Heart disease is a broad term that refers to a range of conditions that affect the heart's structure and function. It encompasses various conditions, including coronary artery disease, heart rhythm disorders (arrhythmias), heart valve defects, congenital heart defects, and others. Heart disease can develop over time due to factors such as high blood pressure, high cholesterol, smoking, diabetes, obesity, sedentary lifestyle, and family history.

A heart attack, on the other hand, is a specific medical event that occurs when blood flow to a part of the heart is blocked or severely reduced, leading to damage or death of the heart muscle tissue. This blockage usually occurs due to the rupture of a plaque (a buildup of cholesterol and other substances) in the coronary arteries, which supply oxygen-rich blood to the heart muscle. When the blood flow is interrupted, the affected part of the heart muscle is deprived of oxygen and nutrients, causing tissue damage.

In summary:

Heart disease is a general term that refers to various conditions affecting the heart. A heart attack is a specific event that occurs when blood flow to a part of the heart is blocked, leading to damage or death of the heart muscle tissue. Heart disease can increase the risk of experiencing a heart attack, but not all heart disease patients will necessarily have a heart attack. There are various types of heart disease, and each may have different symptoms and treatment approaches.

In [1]:
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from statsmodels.stats.outliers_influence import variance_inflation_factor
from statsmodels.tools.tools import add_constant
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import OneHotEncoder
import statsmodels.api as sm
import numpy as np
#!pip install interpret
import interpret
from interpret.glassbox import LogisticRegression


from interpret import set_visualize_provider
from interpret.provider import InlineProvider
set_visualize_provider(InlineProvider())
from sklearn.model_selection import train_test_split
from sklearn.metrics import roc_auc_score
from interpret.glassbox import ExplainableBoostingClassifier
from interpret import show

#!pip install causalinference
from causalinference import CausalModel
from sklearn.linear_model import LogisticRegression
from sklearn.neighbors import NearestNeighbors
import plotly.express as px
from pywaffle.waffle import Waffle
import shutil
from os import path
import warnings
warnings.filterwarnings("ignore", category=DeprecationWarning)
from sklearn.model_selection import  cross_val_score
from sklearn.metrics import accuracy_score, recall_score, precision_score, f1_score, roc_auc_score
from sklearn.preprocessing import StandardScaler
from sklearn import metrics
from sklearn.preprocessing import LabelEncoder
from imblearn.over_sampling import SMOTE
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score, roc_auc_score
/Users/kyusungcho/anaconda3/lib/python3.11/site-packages/pandas/core/arrays/masked.py:60: UserWarning: Pandas requires version '1.3.6' or newer of 'bottleneck' (version '1.3.5' currently installed).
  from pandas.core import (
In [2]:
df = pd.read_csv('heart_2022_no_nans.csv')
df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 246022 entries, 0 to 246021
Data columns (total 40 columns):
 #   Column                     Non-Null Count   Dtype  
---  ------                     --------------   -----  
 0   State                      246022 non-null  object 
 1   Sex                        246022 non-null  object 
 2   GeneralHealth              246022 non-null  object 
 3   PhysicalHealthDays         246022 non-null  float64
 4   MentalHealthDays           246022 non-null  float64
 5   LastCheckupTime            246022 non-null  object 
 6   PhysicalActivities         246022 non-null  object 
 7   SleepHours                 246022 non-null  float64
 8   RemovedTeeth               246022 non-null  object 
 9   HadHeartAttack             246022 non-null  object 
 10  HadAngina                  246022 non-null  object 
 11  HadStroke                  246022 non-null  object 
 12  HadAsthma                  246022 non-null  object 
 13  HadSkinCancer              246022 non-null  object 
 14  HadCOPD                    246022 non-null  object 
 15  HadDepressiveDisorder      246022 non-null  object 
 16  HadKidneyDisease           246022 non-null  object 
 17  HadArthritis               246022 non-null  object 
 18  HadDiabetes                246022 non-null  object 
 19  DeafOrHardOfHearing        246022 non-null  object 
 20  BlindOrVisionDifficulty    246022 non-null  object 
 21  DifficultyConcentrating    246022 non-null  object 
 22  DifficultyWalking          246022 non-null  object 
 23  DifficultyDressingBathing  246022 non-null  object 
 24  DifficultyErrands          246022 non-null  object 
 25  SmokerStatus               246022 non-null  object 
 26  ECigaretteUsage            246022 non-null  object 
 27  ChestScan                  246022 non-null  object 
 28  RaceEthnicityCategory      246022 non-null  object 
 29  AgeCategory                246022 non-null  object 
 30  HeightInMeters             246022 non-null  float64
 31  WeightInKilograms          246022 non-null  float64
 32  BMI                        246022 non-null  float64
 33  AlcoholDrinkers            246022 non-null  object 
 34  HIVTesting                 246022 non-null  object 
 35  FluVaxLast12               246022 non-null  object 
 36  PneumoVaxEver              246022 non-null  object 
 37  TetanusLast10Tdap          246022 non-null  object 
 38  HighRiskLastYear           246022 non-null  object 
 39  CovidPos                   246022 non-null  object 
dtypes: float64(6), object(34)
memory usage: 75.1+ MB
In [3]:
pd.set_option('display.max_columns', None)
df.head()
Out[3]:
State Sex GeneralHealth PhysicalHealthDays MentalHealthDays LastCheckupTime PhysicalActivities SleepHours RemovedTeeth HadHeartAttack HadAngina HadStroke HadAsthma HadSkinCancer HadCOPD HadDepressiveDisorder HadKidneyDisease HadArthritis HadDiabetes DeafOrHardOfHearing BlindOrVisionDifficulty DifficultyConcentrating DifficultyWalking DifficultyDressingBathing DifficultyErrands SmokerStatus ECigaretteUsage ChestScan RaceEthnicityCategory AgeCategory HeightInMeters WeightInKilograms BMI AlcoholDrinkers HIVTesting FluVaxLast12 PneumoVaxEver TetanusLast10Tdap HighRiskLastYear CovidPos
0 Alabama Female Very good 4.0 0.0 Within past year (anytime less than 12 months ... Yes 9.0 None of them No No No No No No No No Yes No No No No No No No Former smoker Never used e-cigarettes in my entire life No White only, Non-Hispanic Age 65 to 69 1.60 71.67 27.99 No No Yes Yes Yes, received Tdap No No
1 Alabama Male Very good 0.0 0.0 Within past year (anytime less than 12 months ... Yes 6.0 None of them No No No No No No No No Yes Yes No No No No No No Former smoker Never used e-cigarettes in my entire life No White only, Non-Hispanic Age 70 to 74 1.78 95.25 30.13 No No Yes Yes Yes, received tetanus shot but not sure what type No No
2 Alabama Male Very good 0.0 0.0 Within past year (anytime less than 12 months ... No 8.0 6 or more, but not all No No No No No No No No Yes No No Yes No Yes No No Former smoker Never used e-cigarettes in my entire life Yes White only, Non-Hispanic Age 75 to 79 1.85 108.86 31.66 Yes No No Yes No, did not receive any tetanus shot in the pa... No Yes
3 Alabama Female Fair 5.0 0.0 Within past year (anytime less than 12 months ... Yes 9.0 None of them No No No No Yes No Yes No Yes No No No No Yes No No Never smoked Never used e-cigarettes in my entire life No White only, Non-Hispanic Age 80 or older 1.70 90.72 31.32 No No Yes Yes No, did not receive any tetanus shot in the pa... No Yes
4 Alabama Female Good 3.0 15.0 Within past year (anytime less than 12 months ... Yes 5.0 1 to 5 No No No No No No No No Yes No No No No No No No Never smoked Never used e-cigarettes in my entire life No White only, Non-Hispanic Age 80 or older 1.55 79.38 33.07 No No Yes Yes No, did not receive any tetanus shot in the pa... No No

EDA¶

In [4]:
%matplotlib inline
sns.set_style("darkgrid")

colors6 = sns.color_palette(['#1337f5', '#E80000', '#0f1e41', '#fd523e', '#404e5c', '#c9bbaa'], 6)
colors2 = sns.color_palette(['#1337f5', '#E80000'], 2)
colors1 = sns.color_palette(['#1337f5'], 1)
In [5]:
numeric_vars = ['PhysicalHealthDays', 'MentalHealthDays', 'SleepHours', 'HeightInMeters', 'WeightInKilograms',
                'BMI']

all_cols = df.columns.tolist()

subtract_set = set(numeric_vars + ['HadHeartAttack'])

categoric_vars =  [col for col in all_cols if col not in subtract_set]
categoric_vars
Out[5]:
['State',
 'Sex',
 'GeneralHealth',
 'LastCheckupTime',
 'PhysicalActivities',
 'RemovedTeeth',
 'HadAngina',
 'HadStroke',
 'HadAsthma',
 'HadSkinCancer',
 'HadCOPD',
 'HadDepressiveDisorder',
 'HadKidneyDisease',
 'HadArthritis',
 'HadDiabetes',
 'DeafOrHardOfHearing',
 'BlindOrVisionDifficulty',
 'DifficultyConcentrating',
 'DifficultyWalking',
 'DifficultyDressingBathing',
 'DifficultyErrands',
 'SmokerStatus',
 'ECigaretteUsage',
 'ChestScan',
 'RaceEthnicityCategory',
 'AgeCategory',
 'AlcoholDrinkers',
 'HIVTesting',
 'FluVaxLast12',
 'PneumoVaxEver',
 'TetanusLast10Tdap',
 'HighRiskLastYear',
 'CovidPos']
In [6]:
for i in categoric_vars:
    print(df[i].value_counts())
    print()
State
Washington              15000
Maryland                 9165
Minnesota                9161
Ohio                     8995
New York                 8923
Texas                    7408
Florida                  7315
Kansas                   6145
Wisconsin                6126
Maine                    6013
Iowa                     5672
Hawaii                   5596
Virginia                 5565
Indiana                  5502
South Carolina           5471
Massachusetts            5465
Arizona                  5462
Utah                     5373
Michigan                 5370
Colorado                 5159
Nebraska                 5107
California               5096
Connecticut              5053
Georgia                  4978
Vermont                  4845
South Dakota             4405
Montana                  4264
Missouri                 4195
New Jersey               3967
New Hampshire            3756
Puerto Rico              3589
Idaho                    3468
Alaska                   3205
Rhode Island             3112
Oregon                   3049
Louisiana                3010
West Virginia            2974
New Mexico               2968
Oklahoma                 2941
Arkansas                 2940
Pennsylvania             2729
Tennessee                2725
Illinois                 2607
North Carolina           2551
North Dakota             2498
Mississippi              2438
Kentucky                 2413
Wyoming                  2410
Delaware                 2155
Alabama                  1902
Nevada                   1769
District of Columbia     1725
Guam                     1549
Virgin Islands            743
Name: count, dtype: int64

Sex
Female    127811
Male      118211
Name: count, dtype: int64

GeneralHealth
Very good    86999
Good         77409
Excellent    41525
Fair         30659
Poor          9430
Name: count, dtype: int64

LastCheckupTime
Within past year (anytime less than 12 months ago)         198153
Within past 2 years (1 year but less than 2 years ago)      23227
Within past 5 years (2 years but less than 5 years ago)     13744
5 or more years ago                                         10898
Name: count, dtype: int64

PhysicalActivities
Yes    191318
No      54704
Name: count, dtype: int64

RemovedTeeth
None of them              131592
1 to 5                     74702
6 or more, but not all     25950
All                        13778
Name: count, dtype: int64

HadAngina
No     231069
Yes     14953
Name: count, dtype: int64

HadStroke
No     235910
Yes     10112
Name: count, dtype: int64

HadAsthma
No     209493
Yes     36529
Name: count, dtype: int64

HadSkinCancer
No     225001
Yes     21021
Name: count, dtype: int64

HadCOPD
No     227028
Yes     18994
Name: count, dtype: int64

HadDepressiveDisorder
No     195402
Yes     50620
Name: count, dtype: int64

HadKidneyDisease
No     234738
Yes     11284
Name: count, dtype: int64

HadArthritis
No     161139
Yes     84883
Name: count, dtype: int64

HadDiabetes
No                                         204834
Yes                                         33813
No, pre-diabetes or borderline diabetes      5392
Yes, but only during pregnancy (female)      1983
Name: count, dtype: int64

DeafOrHardOfHearing
No     224990
Yes     21032
Name: count, dtype: int64

BlindOrVisionDifficulty
No     233796
Yes     12226
Name: count, dtype: int64

DifficultyConcentrating
No     219802
Yes     26220
Name: count, dtype: int64

DifficultyWalking
No     209952
Yes     36070
Name: count, dtype: int64

DifficultyDressingBathing
No     237682
Yes      8340
Name: count, dtype: int64

DifficultyErrands
No     229638
Yes     16384
Name: count, dtype: int64

SmokerStatus
Never smoked                             147737
Former smoker                             68527
Current smoker - now smokes every day     21659
Current smoker - now smokes some days      8099
Name: count, dtype: int64

ECigaretteUsage
Never used e-cigarettes in my entire life    190128
Not at all (right now)                        43281
Use them some days                             6658
Use them every day                             5955
Name: count, dtype: int64

ChestScan
No     141822
Yes    104200
Name: count, dtype: int64

RaceEthnicityCategory
White only, Non-Hispanic         186336
Hispanic                          22570
Black only, Non-Hispanic          19330
Other race only, Non-Hispanic     12205
Multiracial, Non-Hispanic          5581
Name: count, dtype: int64

AgeCategory
Age 65 to 69       28557
Age 60 to 64       26720
Age 70 to 74       25739
Age 55 to 59       22224
Age 50 to 54       19913
Age 75 to 79       18136
Age 80 or older    17816
Age 40 to 44       16973
Age 45 to 49       16753
Age 35 to 39       15614
Age 30 to 34       13346
Age 18 to 24       13122
Age 25 to 29       11109
Name: count, dtype: int64

AlcoholDrinkers
Yes    135307
No     110715
Name: count, dtype: int64

HIVTesting
No     161520
Yes     84502
Name: count, dtype: int64

FluVaxLast12
Yes    131196
No     114826
Name: count, dtype: int64

PneumoVaxEver
No     146130
Yes     99892
Name: count, dtype: int64

TetanusLast10Tdap
No, did not receive any tetanus shot in the past 10 years    81747
Yes, received tetanus shot but not sure what type            74119
Yes, received Tdap                                           70286
Yes, received tetanus shot, but not Tdap                     19870
Name: count, dtype: int64

HighRiskLastYear
No     235446
Yes     10576
Name: count, dtype: int64

CovidPos
No                                                               167306
Yes                                                               70324
Tested positive using home test without a health professional      8392
Name: count, dtype: int64

In [7]:
def show_relation(col, according_to, type_='dis'):
  plt.figure(figsize=(15,7));

  if type_=='dis':
    sns.displot(data=df, x=col, hue=according_to, kind='kde', palette=colors2);
  elif type_=='count':
    if according_to != None:
      perc = df.groupby(col)[according_to].value_counts(normalize=True).reset_index(name='Percentage')
      sns.barplot(data=perc, x=col,y='Percentage', hue=according_to, palette=colors6, order=df[col].value_counts().index);
    else:
      sns.countplot(data=df, x=col, hue=according_to, palette=colors1, order=df[col].value_counts().index);

  if according_to==None:
    plt.title(f'{col}');
  else: 
    plt.title(f'{col} according to {according_to}');
In [8]:
def generate_colors(num):
    colors = []
    lst = list('ABCDEF0123456789')

    for i in range(num):
        colors.append('#'+''.join(np.random.choice(lst, 6)))
        
    return colors
In [9]:
plt.figure(figsize=(15,7));
plt.title('HadHeartAttack Count');
sns.countplot(data=df, x='HadHeartAttack', palette=colors2, order=df['HadHeartAttack'].value_counts().index);

Imbalanced data

In [10]:
# get percentage of attrition then convert to dicrionary
disease_size = (df.groupby('HadHeartAttack').size()*100 / len(df)).to_dict()

# create figure
fig = plt.figure(
    FigureClass=Waffle, # type = waffle figure
    rows=5, # rows of people
    figsize = (9,3),
    values=disease_size, # data

    # legend labels
    labels=[f"{k} ({round(v / sum(disease_size.values()) * 100, 2)}%)" 
            for k, v in disease_size.items()],
    # colors for attrition and no attrition
    colors=(colors2[0], colors2[1]),
    # icons set to person for both attrition and no attriton
    icons = ['heart','heart'],
    # the legend at the bottom, after playing with the 
    # locations i centered it at the bottom
    legend={'loc': 'lower center',
            'bbox_to_anchor': (0.5, -0.5),
            'ncol': len(disease_size),
            'framealpha': 0,
            'fontsize': 20
          },

    # size of icons (people) 
    icon_size=20,

    # add icon to the legend at the bottom 
    icon_legend=True,

    #title of the waffle graph
    title={
        'label': 'Heart Attack Per 100 People',
        'loc': 'center',
        'fontdict': {'fontsize': 20}
          }
)
In [11]:
disease_size
Out[11]:
{'No': 94.53910625878986, 'Yes': 5.4608937412101355}
In [12]:
df.hist(figsize=(16, 12), bins=50, color=colors1);
plt.suptitle("Distribution of Numerical Values");
In [13]:
obj_cols = df.select_dtypes(include='object').columns[1:]
num_cols = df.select_dtypes(exclude='object').columns
print(f'Object columns : {obj_cols}', end='\n\n')
print(f'Numberical columns : {num_cols}')
Object columns : Index(['Sex', 'GeneralHealth', 'LastCheckupTime', 'PhysicalActivities',
       'RemovedTeeth', 'HadHeartAttack', 'HadAngina', 'HadStroke', 'HadAsthma',
       'HadSkinCancer', 'HadCOPD', 'HadDepressiveDisorder', 'HadKidneyDisease',
       'HadArthritis', 'HadDiabetes', 'DeafOrHardOfHearing',
       'BlindOrVisionDifficulty', 'DifficultyConcentrating',
       'DifficultyWalking', 'DifficultyDressingBathing', 'DifficultyErrands',
       'SmokerStatus', 'ECigaretteUsage', 'ChestScan', 'RaceEthnicityCategory',
       'AgeCategory', 'AlcoholDrinkers', 'HIVTesting', 'FluVaxLast12',
       'PneumoVaxEver', 'TetanusLast10Tdap', 'HighRiskLastYear', 'CovidPos'],
      dtype='object')

Numberical columns : Index(['PhysicalHealthDays', 'MentalHealthDays', 'SleepHours',
       'HeightInMeters', 'WeightInKilograms', 'BMI'],
      dtype='object')
In [14]:
plt.figure(figsize=(20, 60))

for i in range(len(obj_cols)):
  plt.subplot(17, 2, i+1)

  if(df[obj_cols[i]].nunique() < 3):
    ax = sns.countplot(data=df, x=obj_cols[i], palette=colors2, order=df[obj_cols[i]].value_counts().index[:6])
  else:
    ax = sns.countplot(data=df, x=obj_cols[i], palette=colors6, order=df[obj_cols[i]].value_counts().index[:6])

  
  plt.title(f'{obj_cols[i]}', fontsize=15, fontweight='bold', color='brown')
  plt.subplots_adjust(hspace=0.5)

  for p in ax.patches:
    height = p.get_height() 
    width = p.get_width()
    percent = height/len(df)

    ax.text(x=p.get_x()+width/2, y=height+2, s=format(percent, ".2%"), fontsize=12, ha='center', weight='bold')

Features are also unbalanced

In [15]:
ax.patches
Out[15]:
<Axes.ArtistList of 3 patches>
In [16]:
df[(df['HadHeartAttack'] == 'Yes')]['Sex'].value_counts()/df.shape[0]*100, 2
Out[16]:
(Sex
 Male      3.456195
 Female    2.004699
 Name: count, dtype: float64,
 2)
In [17]:
df.shape
Out[17]:
(246022, 40)

Most of people in our data are white and have no diabetic.

Left column is distribution without considering HadHeartAttack's yes or no. Right column is distribution with HadHeartAttack's yes.

In [18]:
obj_cols
Out[18]:
Index(['Sex', 'GeneralHealth', 'LastCheckupTime', 'PhysicalActivities',
       'RemovedTeeth', 'HadHeartAttack', 'HadAngina', 'HadStroke', 'HadAsthma',
       'HadSkinCancer', 'HadCOPD', 'HadDepressiveDisorder', 'HadKidneyDisease',
       'HadArthritis', 'HadDiabetes', 'DeafOrHardOfHearing',
       'BlindOrVisionDifficulty', 'DifficultyConcentrating',
       'DifficultyWalking', 'DifficultyDressingBathing', 'DifficultyErrands',
       'SmokerStatus', 'ECigaretteUsage', 'ChestScan', 'RaceEthnicityCategory',
       'AgeCategory', 'AlcoholDrinkers', 'HIVTesting', 'FluVaxLast12',
       'PneumoVaxEver', 'TetanusLast10Tdap', 'HighRiskLastYear', 'CovidPos'],
      dtype='object')
In [19]:
for col in obj_cols:
    fig, ax = plt.subplots(1,2, figsize=(20,20))
    round(df[col].value_counts()/df.shape[0]*100, 2).plot.pie(autopct="%1.2f%%", ax=ax[0], textprops={"color":"white"}, colors=colors6, radius = 0.9)
    round(df[(df['HadHeartAttack'] == 'Yes')][col].value_counts()/df.shape[0]*100, 2).plot.pie(autopct="%1.2f%%", ax=ax[1], textprops={"color":"white"},colors=colors6, radius = 0.9)
    plt.legend(loc="upper right", bbox_to_anchor=(1, 0, 0.5, 1))
    plt.title(f'{col}', fontsize=15, fontweight='bold', color='brown')
    plt.show();

plt.tight_layout()
<Figure size 640x480 with 0 Axes>
In [20]:
def show_relation(col, according_to, type_='dis'):
  plt.figure(figsize=(15,7));

  if type_=='dis':
    sns.displot(data=df, x=col, hue=according_to, kind='kde', palette=colors2);
  elif type_=='count':
    if according_to != None:
      perc = df.groupby(col)[according_to].value_counts(normalize=True).reset_index(name='Percentage')
      sns.barplot(data=perc, x=col,y='Percentage', hue=according_to, palette=colors6, order=df[col].value_counts().index);
    else:
      sns.countplot(data=df, x=col, hue=according_to, palette=colors1, order=df[col].value_counts().index);

  if according_to==None:
    plt.title(f'{col}');
  else: 
    plt.title(f'{col} according to {according_to}');
In [21]:
num_cols[5]
Out[21]:
'BMI'
In [22]:
show_relation(num_cols[5], 'HadHeartAttack');
/Users/kyusungcho/anaconda3/lib/python3.11/site-packages/seaborn/_oldcore.py:1119: FutureWarning: use_inf_as_na option is deprecated and will be removed in a future version. Convert inf values to NaN before operating instead.
  with pd.option_context('mode.use_inf_as_na', True):
/Users/kyusungcho/anaconda3/lib/python3.11/site-packages/seaborn/_oldcore.py:1075: FutureWarning: When grouping with a length-1 list-like, you will need to pass a length-1 tuple to get_group in a future version of pandas. Pass `(name,)` instead of `name` to silence this warning.
  data_subset = grouped_data.get_group(pd_key)
/Users/kyusungcho/anaconda3/lib/python3.11/site-packages/seaborn/_oldcore.py:1075: FutureWarning: When grouping with a length-1 list-like, you will need to pass a length-1 tuple to get_group in a future version of pandas. Pass `(name,)` instead of `name` to silence this warning.
  data_subset = grouped_data.get_group(pd_key)
/Users/kyusungcho/anaconda3/lib/python3.11/site-packages/seaborn/_oldcore.py:1075: FutureWarning: When grouping with a length-1 list-like, you will need to pass a length-1 tuple to get_group in a future version of pandas. Pass `(name,)` instead of `name` to silence this warning.
  data_subset = grouped_data.get_group(pd_key)
/Users/kyusungcho/anaconda3/lib/python3.11/site-packages/seaborn/axisgrid.py:118: UserWarning: The figure layout has changed to tight
  self._figure.tight_layout(*args, **kwargs)
<Figure size 1500x700 with 0 Axes>
In [23]:
plt.figure(figsize=(16, 6), dpi=80)

sns.boxplot(data=df, x='BMI', y='HadHeartAttack', saturation=0.4, 
            width=0.15, boxprops={'zorder': 2},
            showfliers = False, whis=0,  palette=colors2);
sns.violinplot(data=df, x='BMI', y='HadHeartAttack',inner='quartile', palette=colors2);

BMI didnt affect Heart Attack

In [24]:
show_relation(num_cols[1], 'HadHeartAttack')
/Users/kyusungcho/anaconda3/lib/python3.11/site-packages/seaborn/_oldcore.py:1119: FutureWarning: use_inf_as_na option is deprecated and will be removed in a future version. Convert inf values to NaN before operating instead.
  with pd.option_context('mode.use_inf_as_na', True):
/Users/kyusungcho/anaconda3/lib/python3.11/site-packages/seaborn/_oldcore.py:1075: FutureWarning: When grouping with a length-1 list-like, you will need to pass a length-1 tuple to get_group in a future version of pandas. Pass `(name,)` instead of `name` to silence this warning.
  data_subset = grouped_data.get_group(pd_key)
/Users/kyusungcho/anaconda3/lib/python3.11/site-packages/seaborn/_oldcore.py:1075: FutureWarning: When grouping with a length-1 list-like, you will need to pass a length-1 tuple to get_group in a future version of pandas. Pass `(name,)` instead of `name` to silence this warning.
  data_subset = grouped_data.get_group(pd_key)
/Users/kyusungcho/anaconda3/lib/python3.11/site-packages/seaborn/_oldcore.py:1075: FutureWarning: When grouping with a length-1 list-like, you will need to pass a length-1 tuple to get_group in a future version of pandas. Pass `(name,)` instead of `name` to silence this warning.
  data_subset = grouped_data.get_group(pd_key)
/Users/kyusungcho/anaconda3/lib/python3.11/site-packages/seaborn/axisgrid.py:118: UserWarning: The figure layout has changed to tight
  self._figure.tight_layout(*args, **kwargs)
<Figure size 1500x700 with 0 Axes>
In [25]:
obj_cols[25]
Out[25]:
'AgeCategory'
In [26]:
df.groupby('AgeCategory')['HadHeartAttack'].value_counts(normalize=True).reset_index(name='Percentage')
Out[26]:
AgeCategory HadHeartAttack Percentage
0 Age 18 to 24 No 0.996190
1 Age 18 to 24 Yes 0.003810
2 Age 25 to 29 No 0.995769
3 Age 25 to 29 Yes 0.004231
4 Age 30 to 34 No 0.993256
5 Age 30 to 34 Yes 0.006744
6 Age 35 to 39 No 0.990009
7 Age 35 to 39 Yes 0.009991
8 Age 40 to 44 No 0.986567
9 Age 40 to 44 Yes 0.013433
10 Age 45 to 49 No 0.974930
11 Age 45 to 49 Yes 0.025070
12 Age 50 to 54 No 0.964696
13 Age 50 to 54 Yes 0.035304
14 Age 55 to 59 No 0.949964
15 Age 55 to 59 Yes 0.050036
16 Age 60 to 64 No 0.941055
17 Age 60 to 64 Yes 0.058945
18 Age 65 to 69 No 0.924537
19 Age 65 to 69 Yes 0.075463
20 Age 70 to 74 No 0.906445
21 Age 70 to 74 Yes 0.093555
22 Age 75 to 79 No 0.886138
23 Age 75 to 79 Yes 0.113862
24 Age 80 or older No 0.863830
25 Age 80 or older Yes 0.136170
In [27]:
show_relation(obj_cols[25], 'HadHeartAttack', type_='count')
In [28]:
fig, (ax1, ax2) = plt.subplots(ncols=2, figsize=(20, 6))

sns.histplot(data=df.loc[df.HadHeartAttack == 'No'].sort_values("AgeCategory"), x='AgeCategory', ax=ax1);
ax1.set_title("Age Distribution of Poeple Without Heart Attack")

sns.histplot(data=df.loc[df.HadHeartAttack == 'Yes'].sort_values("AgeCategory"), x='AgeCategory',
                  color=colors2[1], ax=ax2);
ax2.set_title("Age Distribution of Heart Attack Patients")


fig.tight_layout()
/Users/kyusungcho/anaconda3/lib/python3.11/site-packages/seaborn/_oldcore.py:1119: FutureWarning: use_inf_as_na option is deprecated and will be removed in a future version. Convert inf values to NaN before operating instead.
  with pd.option_context('mode.use_inf_as_na', True):
/Users/kyusungcho/anaconda3/lib/python3.11/site-packages/seaborn/_oldcore.py:1119: FutureWarning: use_inf_as_na option is deprecated and will be removed in a future version. Convert inf values to NaN before operating instead.
  with pd.option_context('mode.use_inf_as_na', True):
In [29]:
show_relation(obj_cols[0], 'HadHeartAttack', type_='count')
In [30]:
show_relation(obj_cols[21], 'HadHeartAttack', type_='count')

We can observe that the people who are smoking are more susceptible to the heart attack.

In [31]:
show_relation(obj_cols[22], 'HadHeartAttack', type_='count')
In [32]:
show_relation(obj_cols[23], 'HadHeartAttack', type_='count')
In [33]:
show_relation(obj_cols[24], 'HadHeartAttack', type_='count')
In [34]:
show_relation(obj_cols[26], 'HadHeartAttack', type_='count')
In [35]:
show_relation(obj_cols[3], 'HadHeartAttack', type_='count')
In [36]:
show_relation(obj_cols[4], 'HadHeartAttack', type_='count')

Number of Removed Teeth can represent patient's age

In [37]:
show_relation(obj_cols[6], 'HadHeartAttack', type_='count')
In [38]:
show_relation(obj_cols[7], 'HadHeartAttack', type_='count')
In [39]:
show_relation(obj_cols[8], 'HadHeartAttack', type_='count')
In [40]:
show_relation(obj_cols[9], 'HadHeartAttack', type_='count')
In [41]:
show_relation(obj_cols[10], 'HadHeartAttack', type_='count')
In [42]:
show_relation(obj_cols[11], 'HadHeartAttack', type_='count')
In [43]:
show_relation(obj_cols[12], 'HadHeartAttack', type_='count')
In [44]:
show_relation(obj_cols[13], 'HadHeartAttack', type_='count')
In [45]:
show_relation(obj_cols[14], 'HadHeartAttack', type_='count')
In [46]:
show_relation(obj_cols[15], 'HadHeartAttack', type_='count')
In [47]:
fig, ax = plt.subplots(figsize = (14,6))
sns.kdeplot(df[df["HadHeartAttack"]=='Yes']["BMI"], alpha=1,shade = False, color=colors6[0], label="HadHeartAttack", ax = ax)
sns.kdeplot(df[df["HadKidneyDisease"]=='Yes']["BMI"], alpha=1,shade = False, color=colors6[1], label="HadKidneyDisease", ax = ax)
sns.kdeplot(df[df["HadSkinCancer"]=='Yes']["BMI"], alpha=1,shade = False, color=colors6[2], label="HadSkinCancer", ax = ax)
sns.kdeplot(df[df["HadAsthma"]=='Yes']["BMI"], alpha=1,shade = False, color=colors6[3], label="HadAsthma", ax = ax)
sns.kdeplot(df[df["HadStroke"]=='Yes']["BMI"], alpha=1,shade = False, color=colors6[4], label="HadStroke", ax = ax)
sns.kdeplot(df[df["HadDiabetes"]=='Yes']["BMI"], alpha=1,shade = False, color=colors6[5], label="HadDiabetic", ax = ax)


ax.set_xlabel("BMI")
ax.set_ylabel("Frequency")
ax.legend(bbox_to_anchor=(1.02, 1), loc=2, borderaxespad=0.)
plt.show()
/var/folders/q2/_y_3vnwd37j_pj9njcz290gw0000gn/T/ipykernel_46596/2147422513.py:2: FutureWarning: 

`shade` is now deprecated in favor of `fill`; setting `fill=False`.
This will become an error in seaborn v0.14.0; please update your code.

  sns.kdeplot(df[df["HadHeartAttack"]=='Yes']["BMI"], alpha=1,shade = False, color=colors6[0], label="HadHeartAttack", ax = ax)
/Users/kyusungcho/anaconda3/lib/python3.11/site-packages/seaborn/_oldcore.py:1119: FutureWarning: use_inf_as_na option is deprecated and will be removed in a future version. Convert inf values to NaN before operating instead.
  with pd.option_context('mode.use_inf_as_na', True):
/var/folders/q2/_y_3vnwd37j_pj9njcz290gw0000gn/T/ipykernel_46596/2147422513.py:3: FutureWarning: 

`shade` is now deprecated in favor of `fill`; setting `fill=False`.
This will become an error in seaborn v0.14.0; please update your code.

  sns.kdeplot(df[df["HadKidneyDisease"]=='Yes']["BMI"], alpha=1,shade = False, color=colors6[1], label="HadKidneyDisease", ax = ax)
/Users/kyusungcho/anaconda3/lib/python3.11/site-packages/seaborn/_oldcore.py:1119: FutureWarning: use_inf_as_na option is deprecated and will be removed in a future version. Convert inf values to NaN before operating instead.
  with pd.option_context('mode.use_inf_as_na', True):
/var/folders/q2/_y_3vnwd37j_pj9njcz290gw0000gn/T/ipykernel_46596/2147422513.py:4: FutureWarning: 

`shade` is now deprecated in favor of `fill`; setting `fill=False`.
This will become an error in seaborn v0.14.0; please update your code.

  sns.kdeplot(df[df["HadSkinCancer"]=='Yes']["BMI"], alpha=1,shade = False, color=colors6[2], label="HadSkinCancer", ax = ax)
/Users/kyusungcho/anaconda3/lib/python3.11/site-packages/seaborn/_oldcore.py:1119: FutureWarning: use_inf_as_na option is deprecated and will be removed in a future version. Convert inf values to NaN before operating instead.
  with pd.option_context('mode.use_inf_as_na', True):
/var/folders/q2/_y_3vnwd37j_pj9njcz290gw0000gn/T/ipykernel_46596/2147422513.py:5: FutureWarning: 

`shade` is now deprecated in favor of `fill`; setting `fill=False`.
This will become an error in seaborn v0.14.0; please update your code.

  sns.kdeplot(df[df["HadAsthma"]=='Yes']["BMI"], alpha=1,shade = False, color=colors6[3], label="HadAsthma", ax = ax)
/Users/kyusungcho/anaconda3/lib/python3.11/site-packages/seaborn/_oldcore.py:1119: FutureWarning: use_inf_as_na option is deprecated and will be removed in a future version. Convert inf values to NaN before operating instead.
  with pd.option_context('mode.use_inf_as_na', True):
/var/folders/q2/_y_3vnwd37j_pj9njcz290gw0000gn/T/ipykernel_46596/2147422513.py:6: FutureWarning: 

`shade` is now deprecated in favor of `fill`; setting `fill=False`.
This will become an error in seaborn v0.14.0; please update your code.

  sns.kdeplot(df[df["HadStroke"]=='Yes']["BMI"], alpha=1,shade = False, color=colors6[4], label="HadStroke", ax = ax)
/Users/kyusungcho/anaconda3/lib/python3.11/site-packages/seaborn/_oldcore.py:1119: FutureWarning: use_inf_as_na option is deprecated and will be removed in a future version. Convert inf values to NaN before operating instead.
  with pd.option_context('mode.use_inf_as_na', True):
/var/folders/q2/_y_3vnwd37j_pj9njcz290gw0000gn/T/ipykernel_46596/2147422513.py:7: FutureWarning: 

`shade` is now deprecated in favor of `fill`; setting `fill=False`.
This will become an error in seaborn v0.14.0; please update your code.

  sns.kdeplot(df[df["HadDiabetes"]=='Yes']["BMI"], alpha=1,shade = False, color=colors6[5], label="HadDiabetic", ax = ax)
/Users/kyusungcho/anaconda3/lib/python3.11/site-packages/seaborn/_oldcore.py:1119: FutureWarning: use_inf_as_na option is deprecated and will be removed in a future version. Convert inf values to NaN before operating instead.
  with pd.option_context('mode.use_inf_as_na', True):
In [48]:
fig, ax = plt.subplots(figsize = (14,6))
sns.kdeplot(df[df["HadHeartAttack"]=='Yes']["MentalHealthDays"], alpha=1,shade = False, color=colors6[0], label="HadHeartAttack", ax = ax)
sns.kdeplot(df[df["HadKidneyDisease"]=='Yes']["MentalHealthDays"], alpha=1,shade = False, color=colors6[1], label="HadKidneyDisease", ax = ax)
sns.kdeplot(df[df["HadSkinCancer"]=='Yes']["MentalHealthDays"], alpha=1,shade = False, color=colors6[2], label="HadSkinCancer", ax = ax)
sns.kdeplot(df[df["HadAsthma"]=='Yes']["MentalHealthDays"], alpha=1,shade = False, color=colors6[3], label="HadAsthma", ax = ax)
sns.kdeplot(df[df["HadStroke"]=='Yes']["MentalHealthDays"], alpha=1,shade = False, color=colors6[4], label="HadStroke", ax = ax)
sns.kdeplot(df[df["HadDiabetes"]=='Yes']["MentalHealthDays"], alpha=1,shade = False, color=colors6[5], label="HadDiabetes", ax = ax)


ax.set_xlabel("MentalHealthDays")
ax.set_ylabel("Frequency")
ax.legend(bbox_to_anchor=(1.02, 1), loc=2, borderaxespad=0.)
plt.show()
/var/folders/q2/_y_3vnwd37j_pj9njcz290gw0000gn/T/ipykernel_46596/1491682513.py:2: FutureWarning: 

`shade` is now deprecated in favor of `fill`; setting `fill=False`.
This will become an error in seaborn v0.14.0; please update your code.

  sns.kdeplot(df[df["HadHeartAttack"]=='Yes']["MentalHealthDays"], alpha=1,shade = False, color=colors6[0], label="HadHeartAttack", ax = ax)
/Users/kyusungcho/anaconda3/lib/python3.11/site-packages/seaborn/_oldcore.py:1119: FutureWarning: use_inf_as_na option is deprecated and will be removed in a future version. Convert inf values to NaN before operating instead.
  with pd.option_context('mode.use_inf_as_na', True):
/var/folders/q2/_y_3vnwd37j_pj9njcz290gw0000gn/T/ipykernel_46596/1491682513.py:3: FutureWarning: 

`shade` is now deprecated in favor of `fill`; setting `fill=False`.
This will become an error in seaborn v0.14.0; please update your code.

  sns.kdeplot(df[df["HadKidneyDisease"]=='Yes']["MentalHealthDays"], alpha=1,shade = False, color=colors6[1], label="HadKidneyDisease", ax = ax)
/Users/kyusungcho/anaconda3/lib/python3.11/site-packages/seaborn/_oldcore.py:1119: FutureWarning: use_inf_as_na option is deprecated and will be removed in a future version. Convert inf values to NaN before operating instead.
  with pd.option_context('mode.use_inf_as_na', True):
/var/folders/q2/_y_3vnwd37j_pj9njcz290gw0000gn/T/ipykernel_46596/1491682513.py:4: FutureWarning: 

`shade` is now deprecated in favor of `fill`; setting `fill=False`.
This will become an error in seaborn v0.14.0; please update your code.

  sns.kdeplot(df[df["HadSkinCancer"]=='Yes']["MentalHealthDays"], alpha=1,shade = False, color=colors6[2], label="HadSkinCancer", ax = ax)
/Users/kyusungcho/anaconda3/lib/python3.11/site-packages/seaborn/_oldcore.py:1119: FutureWarning: use_inf_as_na option is deprecated and will be removed in a future version. Convert inf values to NaN before operating instead.
  with pd.option_context('mode.use_inf_as_na', True):
/var/folders/q2/_y_3vnwd37j_pj9njcz290gw0000gn/T/ipykernel_46596/1491682513.py:5: FutureWarning: 

`shade` is now deprecated in favor of `fill`; setting `fill=False`.
This will become an error in seaborn v0.14.0; please update your code.

  sns.kdeplot(df[df["HadAsthma"]=='Yes']["MentalHealthDays"], alpha=1,shade = False, color=colors6[3], label="HadAsthma", ax = ax)
/Users/kyusungcho/anaconda3/lib/python3.11/site-packages/seaborn/_oldcore.py:1119: FutureWarning: use_inf_as_na option is deprecated and will be removed in a future version. Convert inf values to NaN before operating instead.
  with pd.option_context('mode.use_inf_as_na', True):
/var/folders/q2/_y_3vnwd37j_pj9njcz290gw0000gn/T/ipykernel_46596/1491682513.py:6: FutureWarning: 

`shade` is now deprecated in favor of `fill`; setting `fill=False`.
This will become an error in seaborn v0.14.0; please update your code.

  sns.kdeplot(df[df["HadStroke"]=='Yes']["MentalHealthDays"], alpha=1,shade = False, color=colors6[4], label="HadStroke", ax = ax)
/Users/kyusungcho/anaconda3/lib/python3.11/site-packages/seaborn/_oldcore.py:1119: FutureWarning: use_inf_as_na option is deprecated and will be removed in a future version. Convert inf values to NaN before operating instead.
  with pd.option_context('mode.use_inf_as_na', True):
/var/folders/q2/_y_3vnwd37j_pj9njcz290gw0000gn/T/ipykernel_46596/1491682513.py:7: FutureWarning: 

`shade` is now deprecated in favor of `fill`; setting `fill=False`.
This will become an error in seaborn v0.14.0; please update your code.

  sns.kdeplot(df[df["HadDiabetes"]=='Yes']["MentalHealthDays"], alpha=1,shade = False, color=colors6[5], label="HadDiabetes", ax = ax)
/Users/kyusungcho/anaconda3/lib/python3.11/site-packages/seaborn/_oldcore.py:1119: FutureWarning: use_inf_as_na option is deprecated and will be removed in a future version. Convert inf values to NaN before operating instead.
  with pd.option_context('mode.use_inf_as_na', True):
In [49]:
fig, ax = plt.subplots(figsize = (14,6))
sns.kdeplot(df[df["HadHeartAttack"]=='Yes']["SleepHours"], alpha=1,shade = False, color=colors6[0], label="HadHeartAttack", ax = ax)
sns.kdeplot(df[df["HadKidneyDisease"]=='Yes']["SleepHours"], alpha=1,shade = False, color=colors6[1], label="HadKidneyDisease", ax = ax)
sns.kdeplot(df[df["HadSkinCancer"]=='Yes']["SleepHours"], alpha=1,shade = False, color=colors6[2], label="HadSkinCancer", ax = ax)
sns.kdeplot(df[df["HadAsthma"]=='Yes']["SleepHours"], alpha=1,shade = False, color=colors6[3], label="HadAsthma", ax = ax)
sns.kdeplot(df[df["HadStroke"]=='Yes']["SleepHours"], alpha=1,shade = False, color=colors6[4], label="HadStroke", ax = ax)
sns.kdeplot(df[df["HadDiabetes"]=='Yes']["SleepHours"], alpha=1,shade = False, color=colors6[5], label="HadDiabetes", ax = ax)


ax.set_xlabel("SleepHours")
ax.set_ylabel("Frequency")
ax.legend(bbox_to_anchor=(1.02, 1), loc=2, borderaxespad=0.)
plt.show()
/var/folders/q2/_y_3vnwd37j_pj9njcz290gw0000gn/T/ipykernel_46596/3313891655.py:2: FutureWarning: 

`shade` is now deprecated in favor of `fill`; setting `fill=False`.
This will become an error in seaborn v0.14.0; please update your code.

  sns.kdeplot(df[df["HadHeartAttack"]=='Yes']["SleepHours"], alpha=1,shade = False, color=colors6[0], label="HadHeartAttack", ax = ax)
/Users/kyusungcho/anaconda3/lib/python3.11/site-packages/seaborn/_oldcore.py:1119: FutureWarning: use_inf_as_na option is deprecated and will be removed in a future version. Convert inf values to NaN before operating instead.
  with pd.option_context('mode.use_inf_as_na', True):
/var/folders/q2/_y_3vnwd37j_pj9njcz290gw0000gn/T/ipykernel_46596/3313891655.py:3: FutureWarning: 

`shade` is now deprecated in favor of `fill`; setting `fill=False`.
This will become an error in seaborn v0.14.0; please update your code.

  sns.kdeplot(df[df["HadKidneyDisease"]=='Yes']["SleepHours"], alpha=1,shade = False, color=colors6[1], label="HadKidneyDisease", ax = ax)
/Users/kyusungcho/anaconda3/lib/python3.11/site-packages/seaborn/_oldcore.py:1119: FutureWarning: use_inf_as_na option is deprecated and will be removed in a future version. Convert inf values to NaN before operating instead.
  with pd.option_context('mode.use_inf_as_na', True):
/var/folders/q2/_y_3vnwd37j_pj9njcz290gw0000gn/T/ipykernel_46596/3313891655.py:4: FutureWarning: 

`shade` is now deprecated in favor of `fill`; setting `fill=False`.
This will become an error in seaborn v0.14.0; please update your code.

  sns.kdeplot(df[df["HadSkinCancer"]=='Yes']["SleepHours"], alpha=1,shade = False, color=colors6[2], label="HadSkinCancer", ax = ax)
/Users/kyusungcho/anaconda3/lib/python3.11/site-packages/seaborn/_oldcore.py:1119: FutureWarning: use_inf_as_na option is deprecated and will be removed in a future version. Convert inf values to NaN before operating instead.
  with pd.option_context('mode.use_inf_as_na', True):
/var/folders/q2/_y_3vnwd37j_pj9njcz290gw0000gn/T/ipykernel_46596/3313891655.py:5: FutureWarning: 

`shade` is now deprecated in favor of `fill`; setting `fill=False`.
This will become an error in seaborn v0.14.0; please update your code.

  sns.kdeplot(df[df["HadAsthma"]=='Yes']["SleepHours"], alpha=1,shade = False, color=colors6[3], label="HadAsthma", ax = ax)
/Users/kyusungcho/anaconda3/lib/python3.11/site-packages/seaborn/_oldcore.py:1119: FutureWarning: use_inf_as_na option is deprecated and will be removed in a future version. Convert inf values to NaN before operating instead.
  with pd.option_context('mode.use_inf_as_na', True):
/var/folders/q2/_y_3vnwd37j_pj9njcz290gw0000gn/T/ipykernel_46596/3313891655.py:6: FutureWarning: 

`shade` is now deprecated in favor of `fill`; setting `fill=False`.
This will become an error in seaborn v0.14.0; please update your code.

  sns.kdeplot(df[df["HadStroke"]=='Yes']["SleepHours"], alpha=1,shade = False, color=colors6[4], label="HadStroke", ax = ax)
/Users/kyusungcho/anaconda3/lib/python3.11/site-packages/seaborn/_oldcore.py:1119: FutureWarning: use_inf_as_na option is deprecated and will be removed in a future version. Convert inf values to NaN before operating instead.
  with pd.option_context('mode.use_inf_as_na', True):
/var/folders/q2/_y_3vnwd37j_pj9njcz290gw0000gn/T/ipykernel_46596/3313891655.py:7: FutureWarning: 

`shade` is now deprecated in favor of `fill`; setting `fill=False`.
This will become an error in seaborn v0.14.0; please update your code.

  sns.kdeplot(df[df["HadDiabetes"]=='Yes']["SleepHours"], alpha=1,shade = False, color=colors6[5], label="HadDiabetes", ax = ax)
/Users/kyusungcho/anaconda3/lib/python3.11/site-packages/seaborn/_oldcore.py:1119: FutureWarning: use_inf_as_na option is deprecated and will be removed in a future version. Convert inf values to NaN before operating instead.
  with pd.option_context('mode.use_inf_as_na', True):

Feature Engineering¶

In [50]:
region_mapping = {
    'Connecticut': 'Northeast', 'Maine': 'Northeast', 'Massachusetts': 'Northeast',
    'New Hampshire': 'Northeast', 'New Jersey': 'Northeast', 'New York': 'Northeast',
    'Pennsylvania': 'Northeast', 'Rhode Island': 'Northeast', 'Vermont': 'Northeast',
    'Illinois': 'Midwest', 'Indiana': 'Midwest', 'Iowa': 'Midwest', 'Kansas': 'Midwest',
    'Michigan': 'Midwest', 'Minnesota': 'Midwest', 'Missouri': 'Midwest', 'Nebraska': 'Midwest',
    'North Dakota': 'Midwest', 'Ohio': 'Midwest', 'South Dakota': 'Midwest', 'Wisconsin': 'Midwest',
    'Alabama': 'South', 'Arkansas': 'South', 'Delaware': 'South', 'Florida': 'South',
    'Georgia': 'South', 'Kentucky': 'South', 'Louisiana': 'South', 'Maryland': 'South',
    'Mississippi': 'South', 'North Carolina': 'South', 'Oklahoma': 'South', 'South Carolina': 'South',
    'Tennessee': 'South', 'Texas': 'South', 'Virginia': 'South', 'West Virginia': 'South',
    'Alaska': 'West', 'Arizona': 'West', 'California': 'West', 'Colorado': 'West',
    'Hawaii': 'West', 'Idaho': 'West', 'Montana': 'West', 'Nevada': 'West',
    'New Mexico': 'West', 'Oregon': 'West', 'Utah': 'West', 'Washington': 'West', 'Wyoming': 'West'
}

df['Location'] = df['State'].map(region_mapping)

# Drop the original column
df.drop('State', axis=1, inplace=True)
In [51]:
df['Location'].value_counts()
Out[51]:
Location
South        65951
Midwest      65783
West         62819
Northeast    43863
Name: count, dtype: int64
In [52]:
df['HadHeartAttack'].value_counts()
#17:1 
Out[52]:
HadHeartAttack
No     232587
Yes     13435
Name: count, dtype: int64

Check correlation matrix for numeric columns

In [53]:
correlation_matrix = df[numeric_vars].corr()

print(correlation_matrix)

sns.heatmap(correlation_matrix, annot=True, cmap='coolwarm', fmt=".2f")
plt.show()
                    PhysicalHealthDays  MentalHealthDays  SleepHours  \
PhysicalHealthDays            1.000000          0.306800   -0.056063   
MentalHealthDays              0.306800          1.000000   -0.130100   
SleepHours                   -0.056063         -0.130100    1.000000   
HeightInMeters               -0.049180         -0.056010   -0.011384   
WeightInKilograms             0.077505          0.042441   -0.054691   
BMI                           0.116905          0.082182   -0.054750   

                    HeightInMeters  WeightInKilograms       BMI  
PhysicalHealthDays       -0.049180           0.077505  0.116905  
MentalHealthDays         -0.056010           0.042441  0.082182  
SleepHours               -0.011384          -0.054691 -0.054750  
HeightInMeters            1.000000           0.473768 -0.026637  
WeightInKilograms         0.473768           1.000000  0.859313  
BMI                      -0.026637           0.859313  1.000000  

Now drop 'WeightInKilograms'

In [54]:
numeric_vars = ['PhysicalHealthDays', 'MentalHealthDays', 'SleepHours', 'HeightInMeters', 'BMI']

df.drop('WeightInKilograms', axis=1, inplace=True)

correlation_matrix = df[numeric_vars].corr()

print(correlation_matrix)

sns.heatmap(correlation_matrix, annot=True, cmap='coolwarm', fmt=".2f")
plt.show()
                    PhysicalHealthDays  MentalHealthDays  SleepHours  \
PhysicalHealthDays            1.000000          0.306800   -0.056063   
MentalHealthDays              0.306800          1.000000   -0.130100   
SleepHours                   -0.056063         -0.130100    1.000000   
HeightInMeters               -0.049180         -0.056010   -0.011384   
BMI                           0.116905          0.082182   -0.054750   

                    HeightInMeters       BMI  
PhysicalHealthDays       -0.049180  0.116905  
MentalHealthDays         -0.056010  0.082182  
SleepHours               -0.011384 -0.054750  
HeightInMeters            1.000000 -0.026637  
BMI                      -0.026637  1.000000  

image.png

In [55]:
numeric_df = df[numeric_vars]

X = add_constant(numeric_df)

vif_data = pd.DataFrame()
vif_data['Variable'] = X.columns

vif_data['VIF'] = [variance_inflation_factor(X.values, i) for i in range(X.shape[1])]

print(vif_data)
             Variable         VIF
0               const  311.539297
1  PhysicalHealthDays    1.115642
2    MentalHealthDays    1.124034
3          SleepHours    1.019833
4      HeightInMeters    1.005062
5                 BMI    1.018593
In [56]:
categoric_vars.remove('State')
categoric_vars.append('HadHeartAttack')
categoric_vars.append('Location')
len(categoric_vars)
Out[56]:
34

Oversampling (Due to imbalanced data)¶

In [65]:
df_encoded = pd.get_dummies(df, columns=categoric_vars, drop_first=True)
In [66]:
pd.set_option('display.max_columns', None)
df_encoded.head(10)
Out[66]:
PhysicalHealthDays MentalHealthDays SleepHours HeightInMeters BMI Sex_Male GeneralHealth_Fair GeneralHealth_Good GeneralHealth_Poor GeneralHealth_Very good LastCheckupTime_Within past 2 years (1 year but less than 2 years ago) LastCheckupTime_Within past 5 years (2 years but less than 5 years ago) LastCheckupTime_Within past year (anytime less than 12 months ago) PhysicalActivities_Yes RemovedTeeth_6 or more, but not all RemovedTeeth_All RemovedTeeth_None of them HadAngina_Yes HadStroke_Yes HadAsthma_Yes HadSkinCancer_Yes HadCOPD_Yes HadDepressiveDisorder_Yes HadKidneyDisease_Yes HadArthritis_Yes HadDiabetes_No, pre-diabetes or borderline diabetes HadDiabetes_Yes HadDiabetes_Yes, but only during pregnancy (female) DeafOrHardOfHearing_Yes BlindOrVisionDifficulty_Yes DifficultyConcentrating_Yes DifficultyWalking_Yes DifficultyDressingBathing_Yes DifficultyErrands_Yes SmokerStatus_Current smoker - now smokes some days SmokerStatus_Former smoker SmokerStatus_Never smoked ECigaretteUsage_Not at all (right now) ECigaretteUsage_Use them every day ECigaretteUsage_Use them some days ChestScan_Yes RaceEthnicityCategory_Hispanic RaceEthnicityCategory_Multiracial, Non-Hispanic RaceEthnicityCategory_Other race only, Non-Hispanic RaceEthnicityCategory_White only, Non-Hispanic AgeCategory_Age 25 to 29 AgeCategory_Age 30 to 34 AgeCategory_Age 35 to 39 AgeCategory_Age 40 to 44 AgeCategory_Age 45 to 49 AgeCategory_Age 50 to 54 AgeCategory_Age 55 to 59 AgeCategory_Age 60 to 64 AgeCategory_Age 65 to 69 AgeCategory_Age 70 to 74 AgeCategory_Age 75 to 79 AgeCategory_Age 80 or older AlcoholDrinkers_Yes HIVTesting_Yes FluVaxLast12_Yes PneumoVaxEver_Yes TetanusLast10Tdap_Yes, received Tdap TetanusLast10Tdap_Yes, received tetanus shot but not sure what type TetanusLast10Tdap_Yes, received tetanus shot, but not Tdap HighRiskLastYear_Yes CovidPos_Tested positive using home test without a health professional CovidPos_Yes HadHeartAttack_Yes Location_Northeast Location_South Location_West
0 4.0 0.0 9.0 1.60 27.99 False False False False True False False True True False False True False False False False False False False True False False False False False False False False False False True False False False False False False False False True False False False False False False False False True False False False False False True True True False False False False False False False True False
1 0.0 0.0 6.0 1.78 30.13 True False False False True False False True True False False True False False False False False False False True False True False False False False False False False False True False False False False False False False False True False False False False False False False False False True False False False False True True False True False False False False False False True False
2 0.0 0.0 8.0 1.85 31.66 True False False False True False False True False True False False False False False False False False False True False False False False True False True False False False True False False False False True False False False True False False False False False False False False False False True False True False False True False False False False False True False False True False
3 5.0 0.0 9.0 1.70 31.32 False True False False False False False True True False False True False False False True False True False True False False False False False False True False False False False True False False False False False False False True False False False False False False False False False False False True False False True True False False False False False True False False True False
4 3.0 15.0 5.0 1.55 33.07 False False True False False False False True True False False False False False False False False False False True False False False False False False False False False False False True False False False False False False False True False False False False False False False False False False False True False False True True False False False False False False False False True False
5 0.0 0.0 7.0 1.85 34.96 True False True False False False False True True False False True False False False False False False False False False False False False False False False False False False False True False False False True False False False True False False False False False True False False False False False False True True True False False True False False False False False False True False
6 3.0 0.0 8.0 1.63 33.30 False False True False False False False True True True False False False True False False False False False False False True False False False False False False False False False True False False False True False False False False False False False False False False False False False False False True False False True True False False False False False False False False True False
7 5.0 0.0 8.0 1.75 24.37 True True False False False False False True True False False False True False False True False False False True False True False False False False False False False False False True False False False True False False False True False False False False False False False False False False True False False True True True False False False False False True True False True False
8 2.0 0.0 6.0 1.70 26.94 True False True False False False False False False False False True False False False False False False False True False False False True False False False False False False True False False False False True False False False True False False False True False False False False False False False False False False False False False False False False False True False False True False
9 0.0 0.0 7.0 1.68 22.60 False False False False True False False True True False False True False False True True False False False True False False False False False False False False False False True False False False False True False False False True False False False False False False False False False False True False False False True True False False False False False False False False True False
In [67]:
pd.set_option('display.max_rows', None)
df_encoded.dtypes
Out[67]:
PhysicalHealthDays                                                         float64
MentalHealthDays                                                           float64
SleepHours                                                                 float64
HeightInMeters                                                             float64
BMI                                                                        float64
Sex_Male                                                                      bool
GeneralHealth_Fair                                                            bool
GeneralHealth_Good                                                            bool
GeneralHealth_Poor                                                            bool
GeneralHealth_Very good                                                       bool
LastCheckupTime_Within past 2 years (1 year but less than 2 years ago)        bool
LastCheckupTime_Within past 5 years (2 years but less than 5 years ago)       bool
LastCheckupTime_Within past year (anytime less than 12 months ago)            bool
PhysicalActivities_Yes                                                        bool
RemovedTeeth_6 or more, but not all                                           bool
RemovedTeeth_All                                                              bool
RemovedTeeth_None of them                                                     bool
HadAngina_Yes                                                                 bool
HadStroke_Yes                                                                 bool
HadAsthma_Yes                                                                 bool
HadSkinCancer_Yes                                                             bool
HadCOPD_Yes                                                                   bool
HadDepressiveDisorder_Yes                                                     bool
HadKidneyDisease_Yes                                                          bool
HadArthritis_Yes                                                              bool
HadDiabetes_No, pre-diabetes or borderline diabetes                           bool
HadDiabetes_Yes                                                               bool
HadDiabetes_Yes, but only during pregnancy (female)                           bool
DeafOrHardOfHearing_Yes                                                       bool
BlindOrVisionDifficulty_Yes                                                   bool
DifficultyConcentrating_Yes                                                   bool
DifficultyWalking_Yes                                                         bool
DifficultyDressingBathing_Yes                                                 bool
DifficultyErrands_Yes                                                         bool
SmokerStatus_Current smoker - now smokes some days                            bool
SmokerStatus_Former smoker                                                    bool
SmokerStatus_Never smoked                                                     bool
ECigaretteUsage_Not at all (right now)                                        bool
ECigaretteUsage_Use them every day                                            bool
ECigaretteUsage_Use them some days                                            bool
ChestScan_Yes                                                                 bool
RaceEthnicityCategory_Hispanic                                                bool
RaceEthnicityCategory_Multiracial, Non-Hispanic                               bool
RaceEthnicityCategory_Other race only, Non-Hispanic                           bool
RaceEthnicityCategory_White only, Non-Hispanic                                bool
AgeCategory_Age 25 to 29                                                      bool
AgeCategory_Age 30 to 34                                                      bool
AgeCategory_Age 35 to 39                                                      bool
AgeCategory_Age 40 to 44                                                      bool
AgeCategory_Age 45 to 49                                                      bool
AgeCategory_Age 50 to 54                                                      bool
AgeCategory_Age 55 to 59                                                      bool
AgeCategory_Age 60 to 64                                                      bool
AgeCategory_Age 65 to 69                                                      bool
AgeCategory_Age 70 to 74                                                      bool
AgeCategory_Age 75 to 79                                                      bool
AgeCategory_Age 80 or older                                                   bool
AlcoholDrinkers_Yes                                                           bool
HIVTesting_Yes                                                                bool
FluVaxLast12_Yes                                                              bool
PneumoVaxEver_Yes                                                             bool
TetanusLast10Tdap_Yes, received Tdap                                          bool
TetanusLast10Tdap_Yes, received tetanus shot but not sure what type           bool
TetanusLast10Tdap_Yes, received tetanus shot, but not Tdap                    bool
HighRiskLastYear_Yes                                                          bool
CovidPos_Tested positive using home test without a health professional        bool
CovidPos_Yes                                                                  bool
HadHeartAttack_Yes                                                            bool
Location_Northeast                                                            bool
Location_South                                                                bool
Location_West                                                                 bool
dtype: object
In [68]:
df_encoded.shape
Out[68]:
(246022, 71)
In [69]:
df_encoded = df_encoded.astype(int)
pd.set_option('display.max_columns', None)
df_encoded.head(10)
Out[69]:
PhysicalHealthDays MentalHealthDays SleepHours HeightInMeters BMI Sex_Male GeneralHealth_Fair GeneralHealth_Good GeneralHealth_Poor GeneralHealth_Very good LastCheckupTime_Within past 2 years (1 year but less than 2 years ago) LastCheckupTime_Within past 5 years (2 years but less than 5 years ago) LastCheckupTime_Within past year (anytime less than 12 months ago) PhysicalActivities_Yes RemovedTeeth_6 or more, but not all RemovedTeeth_All RemovedTeeth_None of them HadAngina_Yes HadStroke_Yes HadAsthma_Yes HadSkinCancer_Yes HadCOPD_Yes HadDepressiveDisorder_Yes HadKidneyDisease_Yes HadArthritis_Yes HadDiabetes_No, pre-diabetes or borderline diabetes HadDiabetes_Yes HadDiabetes_Yes, but only during pregnancy (female) DeafOrHardOfHearing_Yes BlindOrVisionDifficulty_Yes DifficultyConcentrating_Yes DifficultyWalking_Yes DifficultyDressingBathing_Yes DifficultyErrands_Yes SmokerStatus_Current smoker - now smokes some days SmokerStatus_Former smoker SmokerStatus_Never smoked ECigaretteUsage_Not at all (right now) ECigaretteUsage_Use them every day ECigaretteUsage_Use them some days ChestScan_Yes RaceEthnicityCategory_Hispanic RaceEthnicityCategory_Multiracial, Non-Hispanic RaceEthnicityCategory_Other race only, Non-Hispanic RaceEthnicityCategory_White only, Non-Hispanic AgeCategory_Age 25 to 29 AgeCategory_Age 30 to 34 AgeCategory_Age 35 to 39 AgeCategory_Age 40 to 44 AgeCategory_Age 45 to 49 AgeCategory_Age 50 to 54 AgeCategory_Age 55 to 59 AgeCategory_Age 60 to 64 AgeCategory_Age 65 to 69 AgeCategory_Age 70 to 74 AgeCategory_Age 75 to 79 AgeCategory_Age 80 or older AlcoholDrinkers_Yes HIVTesting_Yes FluVaxLast12_Yes PneumoVaxEver_Yes TetanusLast10Tdap_Yes, received Tdap TetanusLast10Tdap_Yes, received tetanus shot but not sure what type TetanusLast10Tdap_Yes, received tetanus shot, but not Tdap HighRiskLastYear_Yes CovidPos_Tested positive using home test without a health professional CovidPos_Yes HadHeartAttack_Yes Location_Northeast Location_South Location_West
0 4 0 9 1 27 0 0 0 0 1 0 0 1 1 0 0 1 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 1 0 0 0 0 0 1 1 1 0 0 0 0 0 0 0 1 0
1 0 0 6 1 30 1 0 0 0 1 0 0 1 1 0 0 1 0 0 0 0 0 0 0 1 0 1 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 1 0 0 0 0 1 1 0 1 0 0 0 0 0 0 1 0
2 0 0 8 1 31 1 0 0 0 1 0 0 1 0 1 0 0 0 0 0 0 0 0 0 1 0 0 0 0 1 0 1 0 0 0 1 0 0 0 0 1 0 0 0 1 0 0 0 0 0 0 0 0 0 0 1 0 1 0 0 1 0 0 0 0 0 1 0 0 1 0
3 5 0 9 1 31 0 1 0 0 0 0 0 1 1 0 0 1 0 0 0 1 0 1 0 1 0 0 0 0 0 0 1 0 0 0 0 1 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 1 0 0 1 1 0 0 0 0 0 1 0 0 1 0
4 3 15 5 1 33 0 0 1 0 0 0 0 1 1 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 1 0 0 1 1 0 0 0 0 0 0 0 0 1 0
5 0 0 7 1 34 1 0 1 0 0 0 0 1 1 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 1 0 0 0 1 0 0 0 0 0 1 0 0 0 0 0 0 1 1 1 0 0 1 0 0 0 0 0 0 1 0
6 3 0 8 1 33 0 0 1 0 0 0 0 1 1 1 0 0 0 1 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 1 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 1 1 0 0 0 0 0 0 0 0 1 0
7 5 0 8 1 24 1 1 0 0 0 0 0 1 1 0 0 0 1 0 0 1 0 0 0 1 0 1 0 0 0 0 0 0 0 0 0 1 0 0 0 1 0 0 0 1 0 0 0 0 0 0 0 0 0 0 1 0 0 1 1 1 0 0 0 0 0 1 1 0 1 0
8 2 0 6 1 26 1 0 1 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 1 0 0 0 1 0 0 0 0 0 0 1 0 0 0 0 1 0 0 0 1 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 1 0
9 0 0 7 1 22 0 0 0 0 1 0 0 1 1 0 0 1 0 0 1 1 0 0 0 1 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 1 0 0 0 1 0 0 0 0 0 0 0 0 0 0 1 0 0 0 1 1 0 0 0 0 0 0 0 0 1 0

Feature selection¶

sm.Logit model¶

In [70]:
X = df_encoded.drop('HadHeartAttack_Yes', axis=1)
y = df_encoded['HadHeartAttack_Yes']
In [71]:
from interpret.glassbox import LogisticRegression

X_train , X_test, y_train,  y_test = train_test_split(X, y, test_size=0.2, random_state=0)

scaler= StandardScaler()
scaler.fit(X_train)
X_train_scaled = scaler.transform(X_train)
X_test_scaled = scaler.transform(X_test)

logreg_ss=LogisticRegression(solver="liblinear", penalty="l2", C=0.00001, max_iter=10000)
logreg_ss.fit(X_train_scaled,y_train)
y_pred_log=logreg_ss.predict(X_test_scaled)

y_pred_proba_log = logreg_ss.predict_proba(X_test_scaled)
fpr_log, tpr_log, _ = metrics.roc_curve(y_test, y_pred_proba_log[:,1])
auc_log = round(metrics.auc(fpr_log, tpr_log),5)


simple_log = pd.DataFrame(data=[accuracy_score(y_test, y_pred_log),
                                                                    precision_score(y_test, y_pred_log, average='binary'),
                                                                    recall_score(y_test, y_pred_log, average='binary'),
                                                                    f1_score(y_test, y_pred_log, average='binary'),
                                                                    roc_auc_score(y_test, y_pred_proba_log[:,1])],
                                                                    index=['Accuracy','Precision','Recall','F1-score','AUC'],
                                                                     columns = ["Logestic_regression_simple"])

simple_log
Out[71]:
Logestic_regression_simple
Accuracy 0.946550
Precision 0.531425
Recall 0.345368
F1-score 0.418656
AUC 0.889810
In [72]:
plt.figure(figsize = (10,5))
ax=sns.countplot(data=df_encoded , x = 'HadHeartAttack_Yes')
for container in ax.containers:
    ax.bar_label(container, label_type='center', rotation=0, color='white')
plt.title("Distribution before resampling", size=16)    
plt.show()
In [73]:
pip install -U imbalanced-learn
Requirement already satisfied: imbalanced-learn in /Users/kyusungcho/anaconda3/lib/python3.11/site-packages (0.12.0)
Requirement already satisfied: numpy>=1.17.3 in /Users/kyusungcho/anaconda3/lib/python3.11/site-packages (from imbalanced-learn) (1.24.3)
Requirement already satisfied: scipy>=1.5.0 in /Users/kyusungcho/anaconda3/lib/python3.11/site-packages (from imbalanced-learn) (1.11.1)
Requirement already satisfied: scikit-learn>=1.0.2 in /Users/kyusungcho/anaconda3/lib/python3.11/site-packages (from imbalanced-learn) (1.2.2)
Requirement already satisfied: joblib>=1.1.1 in /Users/kyusungcho/anaconda3/lib/python3.11/site-packages (from imbalanced-learn) (1.2.0)
Requirement already satisfied: threadpoolctl>=2.0.0 in /Users/kyusungcho/anaconda3/lib/python3.11/site-packages (from imbalanced-learn) (2.2.0)
Note: you may need to restart the kernel to use updated packages.

SMOTE¶

In [74]:
smote = SMOTE(n_jobs=-1, random_state=0)
X_smote, y_smote = smote.fit_resample(X, y)


plt.figure(figsize = (10,6))
ax=sns.countplot( x = y_smote ,)
for container in ax.containers:
    ax.bar_label(container, label_type='center', rotation=0, color='white')
plt.title("Distribution After SMOTE", size=14)
plt.show()
/Users/kyusungcho/anaconda3/lib/python3.11/site-packages/imblearn/over_sampling/_smote/base.py:363: FutureWarning: The parameter `n_jobs` has been deprecated in 0.10 and will be removed in 0.12. You can pass an nearest neighbors estimator where `n_jobs` is already set instead.
  warnings.warn(
In [75]:
X_train , X_test, y_train,  y_test = train_test_split(X_smote, y_smote , test_size=0.2, random_state=0)

scaler= StandardScaler()
scaler.fit(X_train)
X_train_scaled = scaler.transform(X_train)
X_test_scaled = scaler.transform(X_test)

logreg_s=LogisticRegression(solver="liblinear", penalty="l2", C=0.00001, max_iter=10000)
logreg_s.fit(X_train_scaled,y_train)
y_pred_log=logreg_s.predict(X_test_scaled)

y_pred_proba_log = logreg_s.predict_proba(X_test_scaled)
fpr_log, tpr_log, _ = metrics.roc_curve(y_test, y_pred_proba_log[:,1])
auc_log = round(metrics.auc(fpr_log, tpr_log),5)


SMOTE_log = pd.DataFrame(data=[accuracy_score(y_test, y_pred_log),
                                                                    precision_score(y_test, y_pred_log, average='binary'),
                                                                    recall_score(y_test, y_pred_log, average='binary'),
                                                                    f1_score(y_test, y_pred_log, average='binary'),
                                                                    roc_auc_score(y_test, y_pred_proba_log[:,1])],
                                                                    index=['Accuracy','Precision','Recall','F1-score','AUC'],
                                                                     columns = ["Logestic_regression_smote"])

SMOTE_log
Out[75]:
Logestic_regression_smote
Accuracy 0.885312
Precision 0.874661
Recall 0.898344
F1-score 0.886344
AUC 0.953757

final_features have significance_level < 0.05¶

In [76]:
X_smote_const = sm.add_constant(X_smote)

def backward_elimination(X, y, significance_level=0.05):
    num_vars = X.shape[1]
    for i in range(num_vars):
        model = sm.Logit(y, X).fit(disp=0)
        max_p_value = max(model.pvalues.iloc[1:])  # 상수항 제외
        feature_with_max_p_value = model.pvalues.iloc[1:].idxmax()
        if max_p_value > significance_level:
            X = X.drop([feature_with_max_p_value], axis=1)
        else:
            break
    return model, X.columns.tolist()

final_logit_model, final_features = backward_elimination(X_smote_const, y_smote)

print(final_logit_model.summary())
print("Final features selected:", final_features)
                           Logit Regression Results                           
==============================================================================
Dep. Variable:     HadHeartAttack_Yes   No. Observations:               465174
Model:                          Logit   Df Residuals:                   465103
Method:                           MLE   Df Model:                           70
Date:                Tue, 05 Mar 2024   Pseudo R-squ.:                  0.6838
Time:                        17:42:13   Log-Likelihood:            -1.0196e+05
converged:                       True   LL-Null:                   -3.2243e+05
Covariance Type:            nonrobust   LLR p-value:                     0.000
===========================================================================================================================================
                                                                              coef    std err          z      P>|z|      [0.025      0.975]
-------------------------------------------------------------------------------------------------------------------------------------------
const                                                                       9.6304      0.222     43.434      0.000       9.196      10.065
PhysicalHealthDays                                                          0.0138      0.001     16.760      0.000       0.012       0.015
MentalHealthDays                                                           -0.0048      0.001     -5.019      0.000      -0.007      -0.003
SleepHours                                                                 -0.1309      0.004    -32.147      0.000      -0.139      -0.123
HeightInMeters                                                             -1.1759      0.214     -5.503      0.000      -1.595      -0.757
BMI                                                                         0.0186      0.001     17.665      0.000       0.017       0.021
Sex_Male                                                                    0.3892      0.012     31.881      0.000       0.365       0.413
GeneralHealth_Fair                                                         -0.9913      0.022    -45.644      0.000      -1.034      -0.949
GeneralHealth_Good                                                         -1.0248      0.016    -64.260      0.000      -1.056      -0.994
GeneralHealth_Poor                                                         -0.4666      0.035    -13.271      0.000      -0.536      -0.398
GeneralHealth_Very good                                                    -1.2513      0.016    -76.257      0.000      -1.283      -1.219
LastCheckupTime_Within past 2 years (1 year but less than 2 years ago)     -2.2381      0.043    -51.822      0.000      -2.323      -2.153
LastCheckupTime_Within past 5 years (2 years but less than 5 years ago)    -2.6071      0.060    -43.246      0.000      -2.725      -2.489
LastCheckupTime_Within past year (anytime less than 12 months ago)         -0.4060      0.025    -16.032      0.000      -0.456      -0.356
PhysicalActivities_Yes                                                     -0.4414      0.013    -32.817      0.000      -0.468      -0.415
RemovedTeeth_6 or more, but not all                                        -1.0589      0.019    -56.007      0.000      -1.096      -1.022
RemovedTeeth_All                                                           -0.9856      0.023    -42.520      0.000      -1.031      -0.940
RemovedTeeth_None of them                                                  -1.2603      0.014    -92.827      0.000      -1.287      -1.234
HadAngina_Yes                                                               2.5544      0.019    137.029      0.000       2.518       2.591
HadStroke_Yes                                                               0.4309      0.026     16.372      0.000       0.379       0.483
HadAsthma_Yes                                                              -0.6426      0.022    -29.758      0.000      -0.685      -0.600
HadSkinCancer_Yes                                                          -0.6174      0.021    -29.043      0.000      -0.659      -0.576
HadCOPD_Yes                                                                -0.3713      0.023    -16.373      0.000      -0.416      -0.327
HadDepressiveDisorder_Yes                                                  -0.4059      0.021    -19.621      0.000      -0.446      -0.365
HadKidneyDisease_Yes                                                       -0.4088      0.028    -14.713      0.000      -0.463      -0.354
HadArthritis_Yes                                                           -0.1141      0.013     -8.924      0.000      -0.139      -0.089
HadDiabetes_No, pre-diabetes or borderline diabetes                        -1.3309      0.061    -21.840      0.000      -1.450      -1.212
HadDiabetes_Yes                                                            -0.1030      0.016     -6.328      0.000      -0.135      -0.071
HadDiabetes_Yes, but only during pregnancy (female)                        -1.0337      0.148     -6.963      0.000      -1.325      -0.743
DeafOrHardOfHearing_Yes                                                    -0.3241      0.020    -16.033      0.000      -0.364      -0.285
BlindOrVisionDifficulty_Yes                                                -0.3345      0.031    -10.825      0.000      -0.395      -0.274
DifficultyConcentrating_Yes                                                -0.3266      0.026    -12.693      0.000      -0.377      -0.276
DifficultyWalking_Yes                                                      -0.2507      0.019    -13.217      0.000      -0.288      -0.213
DifficultyDressingBathing_Yes                                              -0.2140      0.038     -5.573      0.000      -0.289      -0.139
DifficultyErrands_Yes                                                      -0.2384      0.029     -8.199      0.000      -0.295      -0.181
SmokerStatus_Current smoker - now smokes some days                         -2.0039      0.049    -40.880      0.000      -2.100      -1.908
SmokerStatus_Former smoker                                                 -1.4353      0.019    -77.114      0.000      -1.472      -1.399
SmokerStatus_Never smoked                                                  -2.1291      0.018   -116.406      0.000      -2.165      -2.093
ECigaretteUsage_Not at all (right now)                                     -1.1767      0.020    -59.381      0.000      -1.215      -1.138
ECigaretteUsage_Use them every day                                         -2.3336      0.083    -28.109      0.000      -2.496      -2.171
ECigaretteUsage_Use them some days                                         -2.0157      0.069    -29.004      0.000      -2.152      -1.879
ChestScan_Yes                                                               0.6380      0.012     52.482      0.000       0.614       0.662
RaceEthnicityCategory_Hispanic                                             -1.8906      0.034    -54.990      0.000      -1.958      -1.823
RaceEthnicityCategory_Multiracial, Non-Hispanic                            -1.7094      0.063    -26.933      0.000      -1.834      -1.585
RaceEthnicityCategory_Other race only, Non-Hispanic                        -1.9957      0.044    -45.742      0.000      -2.081      -1.910
RaceEthnicityCategory_White only, Non-Hispanic                             -0.6549      0.018    -35.957      0.000      -0.691      -0.619
AgeCategory_Age 25 to 29                                                   -5.7152      0.131    -43.517      0.000      -5.973      -5.458
AgeCategory_Age 30 to 34                                                   -5.7736      0.103    -55.876      0.000      -5.976      -5.571
AgeCategory_Age 35 to 39                                                   -5.2732      0.069    -76.302      0.000      -5.409      -5.138
AgeCategory_Age 40 to 44                                                   -5.0994      0.058    -88.679      0.000      -5.212      -4.987
AgeCategory_Age 45 to 49                                                   -4.4927      0.044   -102.160      0.000      -4.579      -4.407
AgeCategory_Age 50 to 54                                                   -4.0281      0.034   -119.062      0.000      -4.094      -3.962
AgeCategory_Age 55 to 59                                                   -3.6575      0.028   -130.392      0.000      -3.713      -3.603
AgeCategory_Age 60 to 64                                                   -3.3886      0.024   -139.027      0.000      -3.436      -3.341
AgeCategory_Age 65 to 69                                                   -3.0787      0.022   -137.410      0.000      -3.123      -3.035
AgeCategory_Age 70 to 74                                                   -2.7478      0.022   -124.357      0.000      -2.791      -2.704
AgeCategory_Age 75 to 79                                                   -2.6979      0.024   -112.564      0.000      -2.745      -2.651
AgeCategory_Age 80 or older                                                -2.3676      0.024    -99.454      0.000      -2.414      -2.321
AlcoholDrinkers_Yes                                                        -0.6844      0.012    -55.688      0.000      -0.708      -0.660
HIVTesting_Yes                                                             -0.4166      0.016    -26.190      0.000      -0.448      -0.385
FluVaxLast12_Yes                                                           -0.1687      0.013    -13.185      0.000      -0.194      -0.144
PneumoVaxEver_Yes                                                           0.2566      0.014     18.699      0.000       0.230       0.284
TetanusLast10Tdap_Yes, received Tdap                                       -1.1001      0.017    -65.768      0.000      -1.133      -1.067
TetanusLast10Tdap_Yes, received tetanus shot but not sure what type        -0.8405      0.014    -60.862      0.000      -0.868      -0.813
TetanusLast10Tdap_Yes, received tetanus shot, but not Tdap                 -1.4708      0.027    -54.175      0.000      -1.524      -1.418
HighRiskLastYear_Yes                                                       -1.2989      0.060    -21.826      0.000      -1.416      -1.182
CovidPos_Tested positive using home test without a health professional     -1.8857      0.068    -27.755      0.000      -2.019      -1.753
CovidPos_Yes                                                               -0.7648      0.016    -48.600      0.000      -0.796      -0.734
Location_Northeast                                                         -1.2580      0.018    -69.188      0.000      -1.294      -1.222
Location_South                                                             -1.2565      0.015    -84.610      0.000      -1.286      -1.227
Location_West                                                              -1.0043      0.016    -62.753      0.000      -1.036      -0.973
===========================================================================================================================================
Final features selected: ['const', 'PhysicalHealthDays', 'MentalHealthDays', 'SleepHours', 'HeightInMeters', 'BMI', 'Sex_Male', 'GeneralHealth_Fair', 'GeneralHealth_Good', 'GeneralHealth_Poor', 'GeneralHealth_Very good', 'LastCheckupTime_Within past 2 years (1 year but less than 2 years ago)', 'LastCheckupTime_Within past 5 years (2 years but less than 5 years ago)', 'LastCheckupTime_Within past year (anytime less than 12 months ago)', 'PhysicalActivities_Yes', 'RemovedTeeth_6 or more, but not all', 'RemovedTeeth_All', 'RemovedTeeth_None of them', 'HadAngina_Yes', 'HadStroke_Yes', 'HadAsthma_Yes', 'HadSkinCancer_Yes', 'HadCOPD_Yes', 'HadDepressiveDisorder_Yes', 'HadKidneyDisease_Yes', 'HadArthritis_Yes', 'HadDiabetes_No, pre-diabetes or borderline diabetes', 'HadDiabetes_Yes', 'HadDiabetes_Yes, but only during pregnancy (female)', 'DeafOrHardOfHearing_Yes', 'BlindOrVisionDifficulty_Yes', 'DifficultyConcentrating_Yes', 'DifficultyWalking_Yes', 'DifficultyDressingBathing_Yes', 'DifficultyErrands_Yes', 'SmokerStatus_Current smoker - now smokes some days', 'SmokerStatus_Former smoker', 'SmokerStatus_Never smoked', 'ECigaretteUsage_Not at all (right now)', 'ECigaretteUsage_Use them every day', 'ECigaretteUsage_Use them some days', 'ChestScan_Yes', 'RaceEthnicityCategory_Hispanic', 'RaceEthnicityCategory_Multiracial, Non-Hispanic', 'RaceEthnicityCategory_Other race only, Non-Hispanic', 'RaceEthnicityCategory_White only, Non-Hispanic', 'AgeCategory_Age 25 to 29', 'AgeCategory_Age 30 to 34', 'AgeCategory_Age 35 to 39', 'AgeCategory_Age 40 to 44', 'AgeCategory_Age 45 to 49', 'AgeCategory_Age 50 to 54', 'AgeCategory_Age 55 to 59', 'AgeCategory_Age 60 to 64', 'AgeCategory_Age 65 to 69', 'AgeCategory_Age 70 to 74', 'AgeCategory_Age 75 to 79', 'AgeCategory_Age 80 or older', 'AlcoholDrinkers_Yes', 'HIVTesting_Yes', 'FluVaxLast12_Yes', 'PneumoVaxEver_Yes', 'TetanusLast10Tdap_Yes, received Tdap', 'TetanusLast10Tdap_Yes, received tetanus shot but not sure what type', 'TetanusLast10Tdap_Yes, received tetanus shot, but not Tdap', 'HighRiskLastYear_Yes', 'CovidPos_Tested positive using home test without a health professional', 'CovidPos_Yes', 'Location_Northeast', 'Location_South', 'Location_West']
In [77]:
final_features = ['PhysicalHealthDays', 'MentalHealthDays', 'SleepHours', 'HeightInMeters', 'BMI', 'Sex_Male', 'GeneralHealth_Fair', 'GeneralHealth_Good', 'GeneralHealth_Poor', 'GeneralHealth_Very good', 'LastCheckupTime_Within past 2 years (1 year but less than 2 years ago)', 'LastCheckupTime_Within past 5 years (2 years but less than 5 years ago)', 'LastCheckupTime_Within past year (anytime less than 12 months ago)', 'PhysicalActivities_Yes', 'RemovedTeeth_6 or more, but not all', 'RemovedTeeth_All', 'RemovedTeeth_None of them', 'HadAngina_Yes', 'HadStroke_Yes', 'HadAsthma_Yes', 'HadSkinCancer_Yes', 'HadCOPD_Yes', 'HadDepressiveDisorder_Yes', 'HadKidneyDisease_Yes', 'HadArthritis_Yes', 'HadDiabetes_No, pre-diabetes or borderline diabetes', 'HadDiabetes_Yes', 'HadDiabetes_Yes, but only during pregnancy (female)', 'DeafOrHardOfHearing_Yes', 'BlindOrVisionDifficulty_Yes', 'DifficultyConcentrating_Yes', 'DifficultyWalking_Yes', 'DifficultyDressingBathing_Yes', 'DifficultyErrands_Yes', 'SmokerStatus_Current smoker - now smokes some days', 'SmokerStatus_Former smoker', 'SmokerStatus_Never smoked', 'ECigaretteUsage_Not at all (right now)', 'ECigaretteUsage_Use them every day', 'ECigaretteUsage_Use them some days', 'ChestScan_Yes', 'AlcoholDrinkers_Yes', 'HIVTesting_Yes', 'FluVaxLast12_Yes', 'PneumoVaxEver_Yes', 'TetanusLast10Tdap_Yes, received Tdap', 'TetanusLast10Tdap_Yes, received tetanus shot but not sure what type', 'TetanusLast10Tdap_Yes, received tetanus shot, but not Tdap', 'HighRiskLastYear_Yes', 'CovidPos_Tested positive using home test without a health professional', 'CovidPos_Yes', 'Location_Northeast', 'Location_South', 'Location_West']
len(final_features)
Out[77]:
54
In [78]:
coefficients = final_logit_model.params

coef_df = pd.DataFrame(coefficients, columns=['Coefficient']).reset_index()
coef_df.rename(columns={'index': 'Feature'}, inplace=True)

final_features_with_const = ['const'] + final_features  # 'const' 항목을 추가합니다.

coef_df = coef_df[coef_df['Feature'].isin(final_features_with_const)]

coef_df = coef_df.sort_values(by='Coefficient', ascending=False)

plt.figure(figsize=(10, 8))
sns.barplot(x='Coefficient', y='Feature', data=coef_df)
plt.title('Feature Importance from Logistic Regression')
plt.xlabel('Coefficient Value')
plt.ylabel('Feature')
plt.tight_layout()  # 그래프가 잘리지 않도록 조정
plt.show()

Model Engineering¶

Then move on to LogisticRegression with refined X (54features), y above¶

explain_global()¶

In [79]:
X = df_encoded[final_features]
y = df_encoded['HadHeartAttack_Yes']
X_smote, y_smote = smote.fit_resample(X, y)
/Users/kyusungcho/anaconda3/lib/python3.11/site-packages/imblearn/over_sampling/_smote/base.py:363: FutureWarning: The parameter `n_jobs` has been deprecated in 0.10 and will be removed in 0.12. You can pass an nearest neighbors estimator where `n_jobs` is already set instead.
  warnings.warn(
In [80]:
X_train , X_test, y_train,  y_test = train_test_split(X_smote, y_smote , test_size=0.2, random_state=0)

scaler= StandardScaler()
scaler.fit(X_train)
X_train_scaled = scaler.transform(X_train)
X_test_scaled = scaler.transform(X_test)

logit_model = LogisticRegression(max_iter=3000, random_state=42)
logit_model.fit(X_train_scaled,y_train)

auc = roc_auc_score(y_test, logit_model.predict_proba(X_test)[:, 1])
print("Logistic Regression AUC on Test Set: {:.3f}".format(auc))
Logistic Regression AUC on Test Set: 0.838
/Users/kyusungcho/anaconda3/lib/python3.11/site-packages/sklearn/base.py:432: UserWarning: X has feature names, but LogisticRegression was fitted without feature names
  warnings.warn(
In [81]:
X_train, X_test, y_train, y_test = train_test_split(X_smote, y_smote, test_size=0.2, random_state=0)


logit_model = LogisticRegression(max_iter=3000, random_state=42)
logit_model.fit(X_train, y_train)

y_pred = logit_model.predict(X_test)
y_pred_proba = logit_model.predict_proba(X_test)[:, 1]

auc = roc_auc_score(y_test, y_pred_proba)
accuracy = accuracy_score(y_test, y_pred)
recall = recall_score(y_test, y_pred)
precision = precision_score(y_test, y_pred)
f1 = f1_score(y_test, y_pred)

print("Accuracy: {:.3f}".format(accuracy))
print("Recall: {:.3f}".format(recall))
print("Precision: {:.3f}".format(precision))
print("F1 Score: {:.3f}".format(f1))
print("Logistic Regression AUC on Test Set: {:.3f}".format(auc))
Accuracy: 0.873
Recall: 0.869
Precision: 0.875
F1 Score: 0.872
Logistic Regression AUC on Test Set: 0.944
In [82]:
X_train, X_test, y_train, y_test = train_test_split(X_smote, y_smote, test_size=0.2, random_state=42)


logit_model = LogisticRegression(random_state=42, max_iter=3000)
logit_model.fit(X_train, y_train)

y_pred = logit_model.predict(X_test)
y_pred_prob = logit_model.predict_proba(X_test)[:, 1]

auc = roc_auc_score(y_test, y_pred_prob)
print("Logistic Regression AUC on Test Set: {:.3f}".format(auc))

lr_global = logit_model.explain_global()
show(lr_global)
Logistic Regression AUC on Test Set: 0.945

Interpret LR Plot is not showing in html and ipynb

ExplainableBoostingClassifier() - Explainable Boosting Machine (EBM)¶

Explainable Boosting Machine (EBM) is a machine learning algorithm that combines the principles of traditional gradient boosting with transparent and interpretable modeling techniques. EBM is designed to provide accurate predictions while also offering explanations or interpretations for those predictions, making it particularly useful in domains where understanding the model's reasoning is crucial, such as healthcare or finance.

Key features of EBM include:

Interpretability: EBM constructs models that are inherently interpretable, meaning that the relationships between input features and the predicted outcome are transparent and understandable. This transparency facilitates trust in the model's predictions and helps stakeholders comprehend the factors driving those predictions.

Additive modeling: Similar to traditional boosting algorithms, EBM builds an ensemble of weak learners (often decision trees) sequentially, where each subsequent learner focuses on capturing the patterns that were not adequately addressed by previous learners. However, EBM differs from other boosting methods by using additive rather than multiplicative updates, which simplifies the interpretation of the resulting model.

Monotonicity constraints: EBM allows users to impose monotonicity constraints on the relationships between input features and the predicted outcome. This means that users can specify whether they expect certain features to have a positive or negative impact on the prediction, thereby aligning the model's behavior with domain knowledge or business requirements.

Global and local explanations: EBM provides both global explanations, which describe the overall behavior of the model across the entire dataset, and local explanations, which explain individual predictions. Local explanations help users understand why a particular prediction was made for a specific instance, offering insights into the model's decision-making process.

Overall, EBM strikes a balance between predictive performance and interpretability, making it a valuable tool for applications where understanding the underlying logic of the model is as important as achieving high accuracy.

In [83]:
ebm = ExplainableBoostingClassifier(random_state=42)
ebm.fit(X_train, y_train)

y_pred = ebm.predict(X_test)
y_pred_proba = ebm.predict_proba(X_test)[:, 1]

auc = roc_auc_score(y_test, y_pred_proba)
accuracy = accuracy_score(y_test, y_pred)
recall = recall_score(y_test, y_pred)
precision = precision_score(y_test, y_pred)
f1 = f1_score(y_test, y_pred)

print("Accuracy: {:.3f}".format(accuracy))
print("Recall: {:.3f}".format(recall))
print("Precision: {:.3f}".format(precision))
print("F1 Score: {:.3f}".format(f1))
print("Logistic Regression AUC on Test Set: {:.3f}".format(auc))


ebm_global = ebm.explain_global()
show(ebm_global)
/Users/kyusungcho/anaconda3/lib/python3.11/site-packages/pandas/core/arrays/masked.py:60: UserWarning: Pandas requires version '1.3.6' or newer of 'bottleneck' (version '1.3.5' currently installed).
  from pandas.core import (
/Users/kyusungcho/anaconda3/lib/python3.11/site-packages/pandas/core/arrays/masked.py:60: UserWarning: Pandas requires version '1.3.6' or newer of 'bottleneck' (version '1.3.5' currently installed).
  from pandas.core import (
/Users/kyusungcho/anaconda3/lib/python3.11/site-packages/pandas/core/arrays/masked.py:60: UserWarning: Pandas requires version '1.3.6' or newer of 'bottleneck' (version '1.3.5' currently installed).
  from pandas.core import (
/Users/kyusungcho/anaconda3/lib/python3.11/site-packages/pandas/core/arrays/masked.py:60: UserWarning: Pandas requires version '1.3.6' or newer of 'bottleneck' (version '1.3.5' currently installed).
  from pandas.core import (
/Users/kyusungcho/anaconda3/lib/python3.11/site-packages/pandas/core/arrays/masked.py:60: UserWarning: Pandas requires version '1.3.6' or newer of 'bottleneck' (version '1.3.5' currently installed).
  from pandas.core import (
/Users/kyusungcho/anaconda3/lib/python3.11/site-packages/pandas/core/arrays/masked.py:60: UserWarning: Pandas requires version '1.3.6' or newer of 'bottleneck' (version '1.3.5' currently installed).
  from pandas.core import (
/Users/kyusungcho/anaconda3/lib/python3.11/site-packages/pandas/core/arrays/masked.py:60: UserWarning: Pandas requires version '1.3.6' or newer of 'bottleneck' (version '1.3.5' currently installed).
  from pandas.core import (
/Users/kyusungcho/anaconda3/lib/python3.11/site-packages/pandas/core/arrays/masked.py:60: UserWarning: Pandas requires version '1.3.6' or newer of 'bottleneck' (version '1.3.5' currently installed).
  from pandas.core import (
/Users/kyusungcho/anaconda3/lib/python3.11/site-packages/pandas/core/arrays/masked.py:60: UserWarning: Pandas requires version '1.3.6' or newer of 'bottleneck' (version '1.3.5' currently installed).
  from pandas.core import (
/Users/kyusungcho/anaconda3/lib/python3.11/site-packages/pandas/core/arrays/masked.py:60: UserWarning: Pandas requires version '1.3.6' or newer of 'bottleneck' (version '1.3.5' currently installed).
  from pandas.core import (
/Users/kyusungcho/anaconda3/lib/python3.11/site-packages/pandas/core/arrays/masked.py:60: UserWarning: Pandas requires version '1.3.6' or newer of 'bottleneck' (version '1.3.5' currently installed).
  from pandas.core import (
Accuracy: 0.880
Recall: 0.879
Precision: 0.881
F1 Score: 0.880
Logistic Regression AUC on Test Set: 0.951

Interpret EBM Plot is not showing in html and ipynb

final_features have significance_level < 0.05¶

ExplainableBoostingClassifier() - Explainable Boosting Machine (EBM)¶

Explainable Boosting Machine (EBM) is a machine learning algorithm that combines the principles of traditional gradient boosting with transparent and interpretable modeling techniques. EBM is designed to provide accurate predictions while also offering explanations or interpretations for those predictions, making it particularly useful in domains where understanding the model's reasoning is crucial, such as healthcare or finance.

Key features of EBM include:

Interpretability: EBM constructs models that are inherently interpretable, meaning that the relationships between input features and the predicted outcome are transparent and understandable. This transparency facilitates trust in the model's predictions and helps stakeholders comprehend the factors driving those predictions.

Additive modeling: Similar to traditional boosting algorithms, EBM builds an ensemble of weak learners (often decision trees) sequentially, where each subsequent learner focuses on capturing the patterns that were not adequately addressed by previous learners. However, EBM differs from other boosting methods by using additive rather than multiplicative updates, which simplifies the interpretation of the resulting model.

Monotonicity constraints: EBM allows users to impose monotonicity constraints on the relationships between input features and the predicted outcome. This means that users can specify whether they expect certain features to have a positive or negative impact on the prediction, thereby aligning the model's behavior with domain knowledge or business requirements.

Global and local explanations: EBM provides both global explanations, which describe the overall behavior of the model across the entire dataset, and local explanations, which explain individual predictions. Local explanations help users understand why a particular prediction was made for a specific instance, offering insights into the model's decision-making process.

Overall, EBM strikes a balance between predictive performance and interpretability, making it a valuable tool for applications where understanding the underlying logic of the model is as important as achieving high accuracy.

Propensity Score Matching - To-go glass box model¶

We want to see if there's a difference in HadHeartAttack between patients who HadStroke (Group A) and those who didn't (Group B).

  1. Calculate Propensity Scores: We use logistic regression to calculate propensity scores for each patient, representing the likelihood of HadStroke based on their characteristics (e.g., gender, general health, HadAngina).

  2. Match Individuals: We then match students from Group A with similar propensity scores to students in Group B. For example, if a patient in Group A has a propensity score of 0.7, we find a patient in Group B with a similar score. The goal is to form pairs or sets of individuals who are similar in terms of their propensity to receive the treatment but differ in their actual receipt of the treatment.

  3. Compare Outcomes: With the matched pairs, we can now compare the difference in HadHeartAttack between patients who HadStroke and those who didn't, within each pair.

  4. Assess Results: We analyze the results to determine if there's a significant difference in HadHeartAttack between the two groups after accounting for the propensity score matching.

By using propensity score matching, we aim to reduce the influence of confounding variables and obtain a more accurate estimate of the effect of the HadStroke on HadHeartAttack.

By minimizing the influence of confounding variables, we have greater confidence that any observed differences in outcomes are indeed due to the treatment itself rather than other factors.

In the context of causal inference, the output causal.estimates typically refers to the estimated causal effects obtained from the causal model. The specific output may vary depending on the software or library being used, but commonly, it includes estimates of the Average Treatment Effect (ATE), Average Treatment Effect on the Treated (ATT), and Average Treatment Effect on the Control (ATC). Here's a brief explanation of each:

Average Treatment Effect (ATE): This represents the average causal effect of the treatment on the outcome across the entire population. It provides an estimate of how the outcome variable would change on average if everyone in the population were treated compared to if no one were treated.

Average Treatment Effect on the Treated (ATT): This measures the average causal effect of the treatment on the outcome among individuals who actually received the treatment. It provides insights into how the outcome variable would change for those who received the treatment compared to if they had not received it.

Average Treatment Effect on the Control (ATC): This measures the average causal effect of the treatment on the outcome among individuals who did not receive the treatment. It provides insights into how the outcome variable would change for those who did not receive the treatment compared to if they had received it.

These estimates help researchers understand the impact of the treatment variable on the outcome variable in different subpopulations and provide valuable insights for decision-making and policy formulation. The specific values of ATE, ATT, and ATC obtained from the causal.estimates output will depend on the dataset, the causal modeling approach used, and any assumptions or specifications made during the estimation process.

HadStroke_Yes¶

In [95]:
logit = LogisticRegression()

X_propensity = df_encoded.drop(['HadHeartAttack_Yes', 'HadStroke_Yes'], axis=1)
y_propensity = df_encoded['HadStroke_Yes']

logit.fit(X_propensity, y_propensity)
propensity_scores = logit.predict_proba(X_propensity)[:, 1] 

df_encoded['propensity_score'] = propensity_scores

treated = df_encoded[df_encoded['HadStroke_Yes'] == 1]
control = df_encoded[df_encoded['HadStroke_Yes'] == 0]

nn = NearestNeighbors(n_neighbors=1, metric='euclidean')
nn.fit(control[['propensity_score']])

distances, indices = nn.kneighbors(treated[['propensity_score']])

matched_control = control.iloc[indices.flatten()]
matched_data = pd.concat([treated, matched_control])

treated_effect = matched_data[matched_data['HadStroke_Yes'] == 1]['HadHeartAttack_Yes'].mean()
control_effect = matched_data[matched_data['HadStroke_Yes'] == 0]['HadHeartAttack_Yes'].mean()

print(f"Heart attack incidence rate in the treated group: {treated_effect:.2f}")
print(f"Heart attack incidence rate in the control group: {control_effect:.2f}")
Heart attack incidence rate in the treated group: 0.25
Heart attack incidence rate in the control group: 0.15
/Users/kyusungcho/anaconda3/lib/python3.11/site-packages/sklearn/linear_model/_logistic.py:458: ConvergenceWarning:

lbfgs failed to converge (status=1):
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression

In [96]:
Y = df_encoded['HadHeartAttack_Yes'].values
D = df_encoded['HadStroke_Yes'].values
X = df_encoded.drop(['HadHeartAttack_Yes', 'HadStroke_Yes'], axis=1).values 

causal = CausalModel(Y, D, X)
causal.est_propensity()
causal.est_via_matching()
print(causal.estimates)
Treatment Effect Estimates: Matching

                     Est.       S.e.          z      P>|z|      [95% Conf. int.]
--------------------------------------------------------------------------------
           ATE      0.060      0.013      4.549      0.000      0.034      0.085
           ATC      0.057      0.014      4.219      0.000      0.031      0.084
           ATT      0.113      0.006     18.382      0.000      0.101      0.125

In [97]:
# treated.head(5)
In [98]:
# matched_control.head(5)
In [ ]:
 
In [99]:
# matched_data.head(5)
In [ ]:
from causalinference import CausalModel
In [100]:
logit = LogisticRegression()
X_propensity = df_encoded.drop(['HadHeartAttack_Yes', 'HadAngina_Yes'], axis=1)
y_propensity = df_encoded['HadAngina_Yes']

logit.fit(X_propensity, y_propensity)
propensity_scores = logit.predict_proba(X_propensity)[:, 1]

df_encoded['propensity_score'] = propensity_scores

treated = df_encoded[df_encoded['HadAngina_Yes'] == 1]
control = df_encoded[df_encoded['HadAngina_Yes'] == 0]

nn = NearestNeighbors(n_neighbors=1, metric='euclidean')
nn.fit(control[['propensity_score']])

distances, indices = nn.kneighbors(treated[['propensity_score']])

matched_control = control.iloc[indices.flatten()]
matched_data = pd.concat([treated, matched_control])

treated_effect = matched_data[matched_data['HadAngina_Yes'] == 1]['HadHeartAttack_Yes'].mean()
control_effect = matched_data[matched_data['HadAngina_Yes'] == 0]['HadHeartAttack_Yes'].mean()

print(f"Heart attack incidence rate in the treated group: {treated_effect:.2f}")
print(f"Heart attack incidence rate in the control group: {control_effect:.2f}")
Heart attack incidence rate in the treated group: 0.45
Heart attack incidence rate in the control group: 0.10
/Users/kyusungcho/anaconda3/lib/python3.11/site-packages/sklearn/linear_model/_logistic.py:458: ConvergenceWarning:

lbfgs failed to converge (status=1):
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression

In [101]:
Y = df_encoded['HadHeartAttack_Yes'].values  
D = df_encoded['HadAngina_Yes'].values 
X = df_encoded.drop(['HadHeartAttack_Yes', 'HadAngina_Yes'], axis=1).values
causal = CausalModel(Y, D, X)
causal.est_propensity()
causal.est_via_matching()
print(causal.estimates)
Treatment Effect Estimates: Matching

                     Est.       S.e.          z      P>|z|      [95% Conf. int.]
--------------------------------------------------------------------------------
           ATE      0.311      0.015     20.751      0.000      0.281      0.340
           ATC      0.307      0.016     19.390      0.000      0.276      0.338
           ATT      0.364      0.007     54.992      0.000      0.351      0.377

In [102]:
Y = df_encoded['HadHeartAttack_Yes'].values
D = df_encoded['HadStroke_Yes'].values
X = df_encoded.drop(['HadHeartAttack_Yes', 'HadStroke_Yes'], axis=1).values
causal = CausalModel(Y, D, X)
causal.est_propensity()
causal.est_via_matching()
print(causal.estimates)
Treatment Effect Estimates: Matching

                     Est.       S.e.          z      P>|z|      [95% Conf. int.]
--------------------------------------------------------------------------------
           ATE      0.054      0.013      4.093      0.000      0.028      0.080
           ATC      0.052      0.014      3.782      0.000      0.025      0.079
           ATT      0.107      0.006     17.437      0.000      0.095      0.119

In [103]:
Y = df_encoded['HadHeartAttack_Yes'].values
D = df_encoded['HadStroke_Yes'].values
X = df_encoded.drop(['HadHeartAttack_Yes', 'HadStroke_Yes'], axis=1).values
In [104]:
causal = CausalModel(Y, D, X)
causal.est_propensity()
causal.est_via_matching()
print(causal.estimates)
Treatment Effect Estimates: Matching

                     Est.       S.e.          z      P>|z|      [95% Conf. int.]
--------------------------------------------------------------------------------
           ATE      0.054      0.013      4.093      0.000      0.028      0.080
           ATC      0.052      0.014      3.782      0.000      0.025      0.079
           ATT      0.107      0.006     17.437      0.000      0.095      0.119

In [105]:
df_encoded.shape
Out[105]:
(246022, 72)
In [106]:
#from causalinference import CausalModel

# # Outcome variable
# Y = df_encoded['HadHeartAttack_Yes'].values

# for col in df_encoded.columns.drop('HadHeartAttack_Yes'):
#     # Treatment variable for the current iteration
#     D = df_encoded[col].values
    
#     # Covariates excluding the current treatment variable and the outcome
#     X = df_encoded.drop(['HadHeartAttack_Yes', col], axis=1).values
    
#     # Initialize and estimate the causal model
#     causal = CausalModel(Y, D, X)
#     causal.est_propensity()
#     causal.est_via_matching()
    
#     # Print the causal estimates for the current treatment variable
#     print(f"Causal estimates for treatment variable '{col}':")
#     print(causal.estimates)
In [107]:
df_encoded.columns
Out[107]:
Index(['PhysicalHealthDays', 'MentalHealthDays', 'SleepHours',
       'HeightInMeters', 'BMI', 'Sex_Male', 'GeneralHealth_Fair',
       'GeneralHealth_Good', 'GeneralHealth_Poor', 'GeneralHealth_Very good',
       'LastCheckupTime_Within past 2 years (1 year but less than 2 years ago)',
       'LastCheckupTime_Within past 5 years (2 years but less than 5 years ago)',
       'LastCheckupTime_Within past year (anytime less than 12 months ago)',
       'PhysicalActivities_Yes', 'RemovedTeeth_6 or more, but not all',
       'RemovedTeeth_All', 'RemovedTeeth_None of them', 'HadAngina_Yes',
       'HadStroke_Yes', 'HadAsthma_Yes', 'HadSkinCancer_Yes', 'HadCOPD_Yes',
       'HadDepressiveDisorder_Yes', 'HadKidneyDisease_Yes', 'HadArthritis_Yes',
       'HadDiabetes_No, pre-diabetes or borderline diabetes',
       'HadDiabetes_Yes',
       'HadDiabetes_Yes, but only during pregnancy (female)',
       'DeafOrHardOfHearing_Yes', 'BlindOrVisionDifficulty_Yes',
       'DifficultyConcentrating_Yes', 'DifficultyWalking_Yes',
       'DifficultyDressingBathing_Yes', 'DifficultyErrands_Yes',
       'SmokerStatus_Current smoker - now smokes some days',
       'SmokerStatus_Former smoker', 'SmokerStatus_Never smoked',
       'ECigaretteUsage_Not at all (right now)',
       'ECigaretteUsage_Use them every day',
       'ECigaretteUsage_Use them some days', 'ChestScan_Yes',
       'RaceEthnicityCategory_Hispanic',
       'RaceEthnicityCategory_Multiracial, Non-Hispanic',
       'RaceEthnicityCategory_Other race only, Non-Hispanic',
       'RaceEthnicityCategory_White only, Non-Hispanic',
       'AgeCategory_Age 25 to 29', 'AgeCategory_Age 30 to 34',
       'AgeCategory_Age 35 to 39', 'AgeCategory_Age 40 to 44',
       'AgeCategory_Age 45 to 49', 'AgeCategory_Age 50 to 54',
       'AgeCategory_Age 55 to 59', 'AgeCategory_Age 60 to 64',
       'AgeCategory_Age 65 to 69', 'AgeCategory_Age 70 to 74',
       'AgeCategory_Age 75 to 79', 'AgeCategory_Age 80 or older',
       'AlcoholDrinkers_Yes', 'HIVTesting_Yes', 'FluVaxLast12_Yes',
       'PneumoVaxEver_Yes', 'TetanusLast10Tdap_Yes, received Tdap',
       'TetanusLast10Tdap_Yes, received tetanus shot but not sure what type',
       'TetanusLast10Tdap_Yes, received tetanus shot, but not Tdap',
       'HighRiskLastYear_Yes',
       'CovidPos_Tested positive using home test without a health professional',
       'CovidPos_Yes', 'HadHeartAttack_Yes', 'Location_Northeast',
       'Location_South', 'Location_West', 'propensity_score'],
      dtype='object')

HadAngina_Yes¶

In [108]:
logit = LogisticRegression()
X_propensity = df_encoded.drop(['HadHeartAttack_Yes', 'HadAngina_Yes'], axis=1)
y_propensity = df_encoded['HadAngina_Yes']

logit.fit(X_propensity, y_propensity)
propensity_scores = logit.predict_proba(X_propensity)[:, 1]

df_encoded['propensity_score'] = propensity_scores


treated = df_encoded[df_encoded['HadAngina_Yes'] == 1]
control = df_encoded[df_encoded['HadAngina_Yes'] == 0]

nn = NearestNeighbors(n_neighbors=1, metric='euclidean')
nn.fit(control[['propensity_score']])

distances, indices = nn.kneighbors(treated[['propensity_score']])

matched_control = control.iloc[indices.flatten()]
matched_data = pd.concat([treated, matched_control])

treated_effect = matched_data[matched_data['HadAngina_Yes'] == 1]['HadHeartAttack_Yes'].mean()
control_effect = matched_data[matched_data['HadAngina_Yes'] == 0]['HadHeartAttack_Yes'].mean()

print(f"Heart attack incidence rate in the treated group: {treated_effect:.2f}")
print(f"Heart attack incidence rate in the control group: {control_effect:.2f}")
Heart attack incidence rate in the treated group: 0.45
Heart attack incidence rate in the control group: 0.09
/Users/kyusungcho/anaconda3/lib/python3.11/site-packages/sklearn/linear_model/_logistic.py:458: ConvergenceWarning:

lbfgs failed to converge (status=1):
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression

In [109]:
Y = df_encoded['HadHeartAttack_Yes'].values
D = df_encoded['HadAngina_Yes'].values
X = df_encoded.drop(['HadHeartAttack_Yes', 'HadAngina_Yes'], axis=1).values  # Covariates
In [110]:
causal = CausalModel(Y, D, X)
causal.est_propensity()
causal.est_via_matching()
print(causal.estimates)
Treatment Effect Estimates: Matching

                     Est.       S.e.          z      P>|z|      [95% Conf. int.]
--------------------------------------------------------------------------------
           ATE      0.313      0.015     20.940      0.000      0.283      0.342
           ATC      0.309      0.016     19.578      0.000      0.278      0.340
           ATT      0.364      0.007     55.012      0.000      0.351      0.377

HadDiabetes_Yes¶

In [111]:
X_propensity = df_encoded.drop(['HadHeartAttack_Yes', 'HadDiabetes_Yes'], axis=1)
y_propensity = df_encoded['HadDiabetes_Yes']

logit.fit(X_propensity, y_propensity)
propensity_scores = logit.predict_proba(X_propensity)[:, 1]

df_encoded['propensity_score'] = propensity_scores

treated = df_encoded[df_encoded['HadDiabetes_Yes'] == 1]
control = df_encoded[df_encoded['HadDiabetes_Yes'] == 0]

nn = NearestNeighbors(n_neighbors=1, metric='euclidean')
nn.fit(control[['propensity_score']])

distances, indices = nn.kneighbors(treated[['propensity_score']])

matched_control = control.iloc[indices.flatten()]
matched_data = pd.concat([treated, matched_control])

treated_effect = matched_data[matched_data['HadDiabetes_Yes'] == 1]['HadHeartAttack_Yes'].mean()
control_effect = matched_data[matched_data['HadDiabetes_Yes'] == 0]['HadHeartAttack_Yes'].mean()

print(f"Heart attack incidence rate in the treated group: {treated_effect:.2f}")
print(f"Heart attack incidence rate in the control group: {control_effect:.2f}")
Heart attack incidence rate in the treated group: 0.14
Heart attack incidence rate in the control group: 0.11
/Users/kyusungcho/anaconda3/lib/python3.11/site-packages/sklearn/linear_model/_logistic.py:458: ConvergenceWarning:

lbfgs failed to converge (status=1):
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression

In [112]:
Y = df_encoded['HadHeartAttack_Yes'].values  # Outcome variable
D = df_encoded['HadDiabetes_Yes'].values  # Treatment variable
X = df_encoded.drop(['HadHeartAttack_Yes', 'HadDiabetes_Yes'], axis=1).values  # Covariates
In [113]:
causal = CausalModel(Y, D, X)
causal.est_propensity()
causal.est_via_matching()
print(causal.estimates)
Treatment Effect Estimates: Matching

                     Est.       S.e.          z      P>|z|      [95% Conf. int.]
--------------------------------------------------------------------------------
           ATE      0.016      0.005      3.027      0.002      0.006      0.026
           ATC      0.012      0.006      2.113      0.035      0.001      0.024
           ATT      0.037      0.003     12.814      0.000      0.031      0.042

SmokerStatus_Former smoker¶

In [114]:
X_propensity = df_encoded.drop(['HadHeartAttack_Yes', 'SmokerStatus_Former smoker'], axis=1)
y_propensity = df_encoded['SmokerStatus_Former smoker']

logit.fit(X_propensity, y_propensity)
propensity_scores = logit.predict_proba(X_propensity)[:, 1]

df_encoded['propensity_score'] = propensity_scores

treated = df_encoded[df_encoded['SmokerStatus_Former smoker'] == 1]
control = df_encoded[df_encoded['SmokerStatus_Former smoker'] == 0]

nn = NearestNeighbors(n_neighbors=1, metric='euclidean')
nn.fit(control[['propensity_score']])

distances, indices = nn.kneighbors(treated[['propensity_score']])

matched_control = control.iloc[indices.flatten()]
matched_data = pd.concat([treated, matched_control])

treated_effect = matched_data[matched_data['SmokerStatus_Former smoker'] == 1]['HadHeartAttack_Yes'].mean()
control_effect = matched_data[matched_data['SmokerStatus_Former smoker'] == 0]['HadHeartAttack_Yes'].mean()

print(f"Heart attack incidence rate in the treated group: {treated_effect:.2f}")
print(f"Heart attack incidence rate in the control group: {control_effect:.2f}")
/Users/kyusungcho/anaconda3/lib/python3.11/site-packages/sklearn/linear_model/_logistic.py:458: ConvergenceWarning:

lbfgs failed to converge (status=1):
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression

Heart attack incidence rate in the treated group: 0.08
Heart attack incidence rate in the control group: 0.10
In [ ]:
Y = df_encoded['HadHeartAttack_Yes'].values
D = df_encoded['SmokerStatus_Former smoker'].values
X = df_encoded.drop(['HadHeartAttack_Yes', 'SmokerStatus_Former smoker'], axis=1).values  # Covariates

causal = CausalModel(Y, D, X)
causal.est_propensity()
causal.est_via_matching()
print(causal.estimates)

HadKidneyDisease_Yes¶

In [ ]:
X_propensity = df_encoded.drop(['HadHeartAttack_Yes', 'HadKidneyDisease_Yes'], axis=1)
y_propensity = df_encoded['HadKidneyDisease_Yes']

logit.fit(X_propensity, y_propensity)
propensity_scores = logit.predict_proba(X_propensity)[:, 1]

df_encoded['propensity_score'] = propensity_scores

treated = df_encoded[df_encoded['HadKidneyDisease_Yes'] == 1]
control = df_encoded[df_encoded['HadKidneyDisease_Yes'] == 0]

nn = NearestNeighbors(n_neighbors=1, metric='euclidean')
nn.fit(control[['propensity_score']])

distances, indices = nn.kneighbors(treated[['propensity_score']])

matched_control = control.iloc[indices.flatten()]
matched_data = pd.concat([treated, matched_control])

treated_effect = matched_data[matched_data['HadKidneyDisease_Yes'] == 1]['HadHeartAttack_Yes'].mean()
control_effect = matched_data[matched_data['HadKidneyDisease_Yes'] == 0]['HadHeartAttack_Yes'].mean()

print(f"Heart attack incidence rate in the treated group: {treated_effect:.2f}")
print(f"Heart attack incidence rate in the control group: {control_effect:.2f}")
In [96]:
Y = df_encoded['HadHeartAttack_Yes'].values
D = df_encoded['HadKidneyDisease_Yes'].values
X = df_encoded.drop(['HadHeartAttack_Yes', 'HadKidneyDisease_Yes'], axis=1).values  # Covariates
causal = CausalModel(Y, D, X)
causal.est_propensity()
causal.est_via_matching()
print(causal.estimates)
Treatment Effect Estimates: Matching

                     Est.       S.e.          z      P>|z|      [95% Conf. int.]
--------------------------------------------------------------------------------
           ATE      0.005      0.010      0.538      0.591     -0.014      0.025
           ATC      0.004      0.010      0.404      0.686     -0.016      0.024
           ATT      0.030      0.005      5.829      0.000      0.020      0.040